Thread: Again, sorry, caching.
I traded a couple emails with a guy using one of my open source projects. To make a long story short, he is going to the new version of MySQL for his website because of the new caching feature. He is convinced that it will speed up his web site, and he is probably right. On a web site, a few specific queries get executed, unchanged, repeatedly. Think about an ecommerce site, most of the time it is just a handful of basic queries. These basic queries are usually against pretty large product tables. A caching mechanism would make these queries pretty light weight. The arguments against caching: "It is an application issue" This is completely wrong. Caching can not be done against a database without knowledge of the database, i.e. when the data changes. "If it is mostly static data, why not just make it a static page?" Because a static page is a maintenance nightmare. One uses a database in a web site to allow content to be changed and upgraded dynamically and with a minimum of work. "It isn't very useful" I disagree completely. A cache of most frequently used queries, or specific ones, could make for REALLY good performance in some very specific, but very common, applications. Any system that has a hierarchical "drill down" interface to a data set, ecommerce, libraries, document management systems, etc. will greatly benefit from a query cache. I was thinking that it could be implemented as a keyword or comment in a query. Such as: select * from table where column = 'foo' cacheable or select * from table where column = 'bar' /* cacheable */ Either way, it would speed up a lot of common application types. It would even be very cool if you could just cache the results of sub queries, such as: select * from (select * from table where col1 = 'foo' cacheable) as subset where subset.col2 = 'bar' ; Which would mean that the subquery gets cached, but the greater select need not be. The cache could be like a global temp table. Perhaps the user could even name the cache entry: select * from table where column = 'foo' cache on foo Where one could also do: select * from cache_foo Using a keyword is probably a better idea, it can be picked up by the parser and instruct PostgreSQL to use the cache, otherwise there will be no additional overhead. Having caching within PostgreSQL will be good for data integrity. Application caches can't tell when an update/delete/insert happens, they often have to use a time-out mechanism. OK, let me have it, tell me how terrible an idea this is. tell me how wrong I am.
I previously replied to you vaguely describing a way you could implement this by using a combination of client side caching and database tables and triggers to allow you to determine if your cache is still valid. Someone came right behind me, Tom maybe??, and indicated that the proper/ideal way to do this would be to using postgres' asynchronous database notification mechanisms (listen/notify I believe were the semantics) to alert your application that your cache has become invalid. Basically, a couple of triggers and the use of the list/notify model, and you should be all set. Done properly, a client side cache which is asynchronously notified by the database when it's contents become invalid should be faster than relying on MySQL's database caching scheme. Basically, a strong client side cache is going to prevent your database from even having to return a cached result set while a database side cache is going to always return a result set. Of course, one of the extra cool things you can do is to cache a gzip'd copy of the data contents which would further act as an optimization preventing the client or web server (in case they are different) from having to recompress every result set. In the long run, again, if properly done, you should be able to beat MySQL's implementation without too extra much effort. Why? Because a client side cache can be much smarter in the way that it uses it's cached contents much in the same way an application is able to better cache it's data then what the file system is able to do. This is why an client side cache should be preferred over that of a database result set cache. Greg References: http://www.postgresql.org/idocs/index.php?sql-notify.html http://www.postgresql.org/idocs/index.php?sql-listen.html On Sat, 2002-03-16 at 08:01, mlw wrote: > I traded a couple emails with a guy using one of my open source projects. To > make a long story short, he is going to the new version of MySQL for his > website because of the new caching feature. He is convinced that it will speed > up his web site, and he is probably right. > > On a web site, a few specific queries get executed, unchanged, repeatedly. > Think about an ecommerce site, most of the time it is just a handful of basic > queries. These basic queries are usually against pretty large product tables. A > caching mechanism would make these queries pretty light weight. > > The arguments against caching: > > "It is an application issue" > This is completely wrong. Caching can not be done against a database without > knowledge of the database, i.e. when the data changes. > > "If it is mostly static data, why not just make it a static page?" > Because a static page is a maintenance nightmare. One uses a database in a web > site to allow content to be changed and upgraded dynamically and with a minimum > of work. > > "It isn't very useful" > I disagree completely. A cache of most frequently used queries, or specific > ones, could make for REALLY good performance in some very specific, but very > common, applications. Any system that has a hierarchical "drill down" interface > to a data set, ecommerce, libraries, document management systems, etc. will > greatly benefit from a query cache. > > I was thinking that it could be implemented as a keyword or comment in a query. > Such as: > > select * from table where column = 'foo' cacheable > or > select * from table where column = 'bar' /* cacheable */ > > Either way, it would speed up a lot of common application types. It would even > be very cool if you could just cache the results of sub queries, such as: > > select * from (select * from table where col1 = 'foo' cacheable) as subset > where subset.col2 = 'bar' ; > > Which would mean that the subquery gets cached, but the greater select need not > be. The cache could be like a global temp table. Perhaps the user could even > name the cache entry: > > select * from table where column = 'foo' cache on foo > > Where one could also do: > > select * from cache_foo > > Using a keyword is probably a better idea, it can be picked up by the parser > and instruct PostgreSQL to use the cache, otherwise there will be no additional > overhead. > > Having caching within PostgreSQL will be good for data integrity. Application > caches can't tell when an update/delete/insert happens, they often have to use > a time-out mechanism. > > OK, let me have it, tell me how terrible an idea this is. tell me how wrong I > am. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Triggers and asynchronous notification are not substitutes for real hard ACID complient caching. The way you suggest implies only one access model. Take the notion of a library, they have both web and application access. These should both be able to use the cache. Also, your suggestion does not address the sub-select case, which I think is much bigger, performance wise, and more efficient than MySQL's cache. This whole discussion could be moot, and this could be developed as an extension, if there were a function API that could return sets of whole rows. Greg Copeland wrote: > > I previously replied to you vaguely describing a way you could implement > this by using a combination of client side caching and database tables > and triggers to allow you to determine if your cache is still valid. > Someone came right behind me, Tom maybe??, and indicated that the > proper/ideal way to do this would be to using postgres' asynchronous > database notification mechanisms (listen/notify I believe were the > semantics) to alert your application that your cache has become > invalid. Basically, a couple of triggers and the use of the list/notify > model, and you should be all set. > > Done properly, a client side cache which is asynchronously notified by > the database when it's contents become invalid should be faster than > relying on MySQL's database caching scheme. Basically, a strong client > side cache is going to prevent your database from even having to return > a cached result set while a database side cache is going to always > return a result set. Of course, one of the extra cool things you can do > is to cache a gzip'd copy of the data contents which would further act > as an optimization preventing the client or web server (in case they are > different) from having to recompress every result set. > > In the long run, again, if properly done, you should be able to beat > MySQL's implementation without too extra much effort. Why? Because a > client side cache can be much smarter in the way that it uses it's > cached contents much in the same way an application is able to better > cache it's data then what the file system is able to do. This is why an > client side cache should be preferred over that of a database result set > cache. > > Greg > > References: > http://www.postgresql.org/idocs/index.php?sql-notify.html > http://www.postgresql.org/idocs/index.php?sql-listen.html > > On Sat, 2002-03-16 at 08:01, mlw wrote: > > I traded a couple emails with a guy using one of my open source projects. To > > make a long story short, he is going to the new version of MySQL for his > > website because of the new caching feature. He is convinced that it will speed > > up his web site, and he is probably right. > > > > On a web site, a few specific queries get executed, unchanged, repeatedly. > > Think about an ecommerce site, most of the time it is just a handful of basic > > queries. These basic queries are usually against pretty large product tables. A > > caching mechanism would make these queries pretty light weight. > > > > The arguments against caching: > > > > "It is an application issue" > > This is completely wrong. Caching can not be done against a database without > > knowledge of the database, i.e. when the data changes. > > > > "If it is mostly static data, why not just make it a static page?" > > Because a static page is a maintenance nightmare. One uses a database in a web > > site to allow content to be changed and upgraded dynamically and with a minimum > > of work. > > > > "It isn't very useful" > > I disagree completely. A cache of most frequently used queries, or specific > > ones, could make for REALLY good performance in some very specific, but very > > common, applications. Any system that has a hierarchical "drill down" interface > > to a data set, ecommerce, libraries, document management systems, etc. will > > greatly benefit from a query cache. > > > > I was thinking that it could be implemented as a keyword or comment in a query. > > Such as: > > > > select * from table where column = 'foo' cacheable > > or > > select * from table where column = 'bar' /* cacheable */ > > > > Either way, it would speed up a lot of common application types. It would even > > be very cool if you could just cache the results of sub queries, such as: > > > > select * from (select * from table where col1 = 'foo' cacheable) as subset > > where subset.col2 = 'bar' ; > > > > Which would mean that the subquery gets cached, but the greater select need not > > be. The cache could be like a global temp table. Perhaps the user could even > > name the cache entry: > > > > select * from table where column = 'foo' cache on foo > > > > Where one could also do: > > > > select * from cache_foo > > > > Using a keyword is probably a better idea, it can be picked up by the parser > > and instruct PostgreSQL to use the cache, otherwise there will be no additional > > overhead. > > > > Having caching within PostgreSQL will be good for data integrity. Application > > caches can't tell when an update/delete/insert happens, they often have to use > > a time-out mechanism. > > > > OK, let me have it, tell me how terrible an idea this is. tell me how wrong I > > am. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > ------------------------------------------------------------------------------- > Name: signature.asc > signature.asc Type: application/pgp-signature > Description: This is a digitally signed message part
On Sat, 2002-03-16 at 08:01, mlw wrote: [snip] > "If it is mostly static data, why not just make it a static page?" > Because a static page is a maintenance nightmare. One uses a database in a web > site to allow content to be changed and upgraded dynamically and with a minimum > of work. > Oh ya, I forgot that reply to that part. I think you are forgetting that you can use a database to generate a static page. That is, only regenerate the static page when the data within the database changes. Again, this is another example of efficient application caching. If you have an application which listens for your cache invalidation event, you can then recreate your static page. Again, database result set caching is not required. And again, then should be significantly faster than MySQL's result set caching. Also worth noting that you could then gzip your static page (keeping both static pages -- compressed and uncompressed) resulting in yet another optimization for most web servers and browsers. Greg
On Sat, 2002-03-16 at 08:36, mlw wrote: > Triggers and asynchronous notification are not substitutes for real hard ACID > complient caching. The way you suggest implies only one access model. Take the > notion of a library, they have both web and application access. These should > both be able to use the cache. > Well, obviously, you'd need to re-implement the client side cache in each implementation of the client. That is a down side and I certainly won't argue that. As for the "no substitute" comment, I'm guess I'll plead ignorance because I'm not sure what I'm missing here. What am I missing that would not be properly covered by that model? > Also, your suggestion does not address the sub-select case, which I think is > much bigger, performance wise, and more efficient than MySQL's cache. I'm really not sure what you mean by that. Doesn't address it but is more efficient? Maybe it's because I've not had my morning coffee yet... ;) > > This whole discussion could be moot, and this could be developed as an > extension, if there were a function API that could return sets of whole rows. > Maybe...but you did ask for feedback. :) Greg
> I was thinking that it could be implemented as a keyword or comment in a query. > Such as: > > select * from table where column = 'foo' cacheable > or > select * from table where column = 'bar' /* cacheable */ > Having caching within PostgreSQL will be good for data integrity. Application > caches can't tell when an update/delete/insert happens, they often have to use > a time-out mechanism. > > OK, let me have it, tell me how terrible an idea this is. tell me how wrong I > am. I don't think it's a bad idea, but a cache that takes a query string (or subquery string) and looks for a match based on that is flawed without special consideration to non-cacheable functions and constructs (CURRENT_USER, things that depend on timezone, things that depend on datestyle). We'd also need to work out an appropriate mechanism to deal with cache invalidation and adding things to the cache.
Andrew Sullivan wrote: > > On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote: > > > "If it is mostly static data, why not just make it a static page?" > > Because a static page is a maintenance nightmare. One uses a > > database in a web site to allow content to be changed and upgraded > > dynamically and with a minimum of work. > > This seems wrong to me. Why not build an extra bit of functionality > so that when the admin makes a static-data change, the new static > data gets pushed into the static files? > > I was originally intrigued by the suggestion you made, but the more I > thought about it (and read the arguments of others) the more > convinced I became that the MySQL approach is a mistake. It's > probably worth it for their users, who seem not to care that much > about ACID anyway. But I think for a system that really wants to > play in the big leagues, the cache is a big feature that requires a > lot of development, but which is not adequately useful for most > cases. If we had infinite developer resources, it might be worth it. > In the actual case, I think it's too low a priority. Again, I can't speak to priority, but I can name a few common application where caching would be a great benefit. The more I think about it, the more I like the idea of a 'cacheable' keyword in the select statement. My big problem with putting the cache outside of the database is that it is now incumbent on the applications programmer to write a cache. A database should manage the data, the application should handle how the data is presented. Forcing the application to implement a cache feels wrong.
Greg Copeland wrote: > > On Sat, 2002-03-16 at 08:36, mlw wrote: > > Triggers and asynchronous notification are not substitutes for real hard ACID > > complient caching. The way you suggest implies only one access model. Take the > > notion of a library, they have both web and application access. These should > > both be able to use the cache. > > > > Well, obviously, you'd need to re-implement the client side cache in > each implementation of the client. That is a down side and I certainly > won't argue that. As for the "no substitute" comment, I'm guess I'll > plead ignorance because I'm not sure what I'm missing here. What am I > missing that would not be properly covered by that model? It would not be guarenteed to be up to date with the state of the database. By implementing the cache within the database, PostgreSQL could maintain the consistency. > > > Also, your suggestion does not address the sub-select case, which I think is > > much bigger, performance wise, and more efficient than MySQL's cache. > > I'm really not sure what you mean by that. Doesn't address it but is > more efficient? Maybe it's because I've not had my morning coffee > yet... ;) If an internal caching system can be implemented within PostgreSQL, and trust me, I undersand what a hairball it would be with multiversion concurrency, omplex queries such as: select * from (select * from mytable where foo = 'bar' cacheable) as subset where subset.col = 'value' The 'cacheable' keyword applied to the query would mean that PostgreSQL could keep that result set handy for later use. If mytable and that subselect always does a table scan, no one can argue that this subquery caching could be a huge win. As a side note, I REALLY like the idea of a keyword for caching as apposed to automated caching. t would allow the DBA or developer more control over PostgreSQL's behavior, and poentially make the fature easier to implement. > > > > > This whole discussion could be moot, and this could be developed as an > > extension, if there were a function API that could return sets of whole rows. > > > Currently a function can only return one value or a setof a single type, implemented as one function call for each entry in a set. If there could be a function interface which could return a row, and multiple rows similar to the 'setof' return, that would be very cool. That way caching can be implemented as: select * from pgcache('select * from mytable where foo='bar') as subset where subset.col = 'value';
I think the notion that data is managed outside of the database is bogus. Query caching can improve performance in some specific, but popular, scenarios. Saying it does not belong within the database and is the job of the application, is like saying file caching is not a job of the file system but is the job of the application. This is a functionality many users want, and can be justified by some very specific, but very common, scenarios. It is not me to say if it is worth the work, or if it should be done. From the perspective of the user, having this capability within the database is an important feature, I want to make the argument. Greg Copeland wrote: > > I previously replied to you vaguely describing a way you could implement > this by using a combination of client side caching and database tables > and triggers to allow you to determine if your cache is still valid. > Someone came right behind me, Tom maybe??, and indicated that the > proper/ideal way to do this would be to using postgres' asynchronous > database notification mechanisms (listen/notify I believe were the > semantics) to alert your application that your cache has become > invalid. Basically, a couple of triggers and the use of the list/notify > model, and you should be all set. > > Done properly, a client side cache which is asynchronously notified by > the database when it's contents become invalid should be faster than > relying on MySQL's database caching scheme. Basically, a strong client > side cache is going to prevent your database from even having to return > a cached result set while a database side cache is going to always > return a result set. Of course, one of the extra cool things you can do > is to cache a gzip'd copy of the data contents which would further act > as an optimization preventing the client or web server (in case they are > different) from having to recompress every result set. > > In the long run, again, if properly done, you should be able to beat > MySQL's implementation without too extra much effort. Why? Because a > client side cache can be much smarter in the way that it uses it's > cached contents much in the same way an application is able to better > cache it's data then what the file system is able to do. This is why an > client side cache should be preferred over that of a database result set > cache. > > Greg >
On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote: > "If it is mostly static data, why not just make it a static page?" > Because a static page is a maintenance nightmare. One uses a database in a web > site to allow content to be changed and upgraded dynamically and with a minimum > of work. It's ugly argumentation for DB cache. What generate web page after data change and next time use it as static? > I was thinking that it could be implemented as a keyword or comment in a query. > Such as: > > select * from table where column = 'foo' cacheable You can insert "mostly static data" into temp table and in next queries use this temp table. After update/delete/insert canyour applicationrebuild temp table (or by trigger?). Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Karel Zak wrote: > > On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote: > > > "If it is mostly static data, why not just make it a static page?" > > Because a static page is a maintenance nightmare. One uses a database in a web > > site to allow content to be changed and upgraded dynamically and with a minimum > > of work. > > It's ugly argumentation for DB cache. What generate web page after data > change and next time use it as static? > > > I was thinking that it could be implemented as a keyword or comment in a query. > > Such as: > > > > select * from table where column = 'foo' cacheable > > You can insert "mostly static data" into temp table and in next queries > use this temp table. After update/delete/insert can your application > rebuild temp table (or by trigger?). Yes, I could, as could most of the guys reading these messages. I am thinking about a feature in PostgreSQL that would make that easier for the average DBA or web producer. Lets face it, MySQL wins a lot of people because they put in features that people want. All the ways people have suggested to "compete" with MySQL's caching have been ugly kludges. I understand the there is an amount of work involved with doing caching, and the value of caching is debatable by some, however, it is demonstrable that caching can improve a very common, albeit specific, set of deployments. Also, managing data is the job of the database, not the application. It does belong in PostgreSQL, if someone is forced to write a caching scheme around PostgreSQL, it is because PostgreSQL lacks that feature.
Le Lundi 18 Mars 2002 13:23, mlw a écrit : > Lets face it, MySQL wins a lot of people because they put in features that > people want. MySQL is very interested in benchmarks. It does not really care for data consistency. Cheers, Jean-Michel POURE
Yes. EVERY person that I've ever known which runs MySQL run for two very simple reasons. First, they believe it to be wicked fast. Second, they don't understand what ACID is, what a transaction is, or why running a single session against a database to perform a benchmark is a completely bogus concept. In case it's not obvious, these are usually people that are trying to take a step up from Access. While I do believe MySQL, from a performance perspective, is a step up from Access I always tell my clients...if you wouldn't use an Access database for this project, you shouldn't use MySQL either. To me, this means we need better advertising, PR, and education rather than a result set cache. :P Speaking of which, I'm wondering if there are any design patterns we can look at which would address client side caching...well, at least make it easier to implement as well as implement it in a consistent manner. Greg On Mon, 2002-03-18 at 07:32, Jean-Michel POURE wrote: > Le Lundi 18 Mars 2002 13:23, mlw a écrit : > > Lets face it, MySQL wins a lot of people because they put in features that > > people want. > > MySQL is very interested in benchmarks. > It does not really care for data consistency. > > Cheers, Jean-Michel POURE > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Jean-Michel POURE wrote: > > Le Lundi 18 Mars 2002 13:23, mlw a écrit : > > Lets face it, MySQL wins a lot of people because they put in features that > > people want. > > MySQL is very interested in benchmarks. > It does not really care for data consistency. In no way am I suggesting we avoid ACID compliance. In no way am I suggesting that PostgreSQL change. All I am suggesting is that tables which change infrequently can and should be cached. select * from table where foo = 'bar' Need not be executed twice if the table has not changed. select * from table1, (select * from table2 where foo='bar' cacheable) as subset were subset.col1 = table1.col1; In the above query, if table two changes 4 times a day, and it queried a couple times a minute or second, the caching of the subset could save a huge amount of disk I/O. This sort of query could improve many catalog based implementations, from music, to movies, to books. A library could implement a SQL query for book lookups like this: select * from authors, (select * from books where genre = 'scifi' cacheable) as subset where authors.id = subset.auhorid and authors.id in (....) Yes it is arguable that index scans may work better, and obviously, summary tables may help, etc. but imagine a more complex join which produces fewer records, but is executed frequently. Caching could help the performance of PostgreSQL in some very real applications. MySQL's quest for benchmarking numbers, I agree, is shameful because they create numbers which are not really applicable in the real world. This time, however, I think they may be on to something. (1) PostgreSQL use a "cacheable" or "iscacheable" keyword. (2) If the query uses functions which are not marked as "iscacheable," then it is not cached. (3) If any table contained within the cacheable portion of the query is modified, the cache is marked as dirty. (4) No provisions are made to recreate the cache after an insert/update/delete. (5) The first query marked as "iscacheable" that encounters a "dirty" flag in a table, does an exhaustive search on the cache and removes all entries that are affected. As far as I can see, if the above parameters are used to define caching, it could improve performance on sites where a high number of transactions are made, where there is also a large amount of static data, i.e. a ecommerce site, library, etc. If the "iscacheable" keyword is not used, PostgreSQL will not incur any performance degradation. However, if he "iscacheable" keyword is used, the performance loss could very well be made up by the benefits of caching.
"Mattew T. O'Connor" wrote: > > > My big problem with putting the cache outside of the database is that it is > > now incumbent on the applications programmer to write a cache. A database > > should manage the data, the application should handle how the data is > > presented. Forcing the application to implement a cache feels wrong. > > I believe someone suggested a possible solution that was in the pg client > using NOTICE and triggers. The argument given against it, was that > it would not be ACID compliant. I say, who cares. I would think that the > "select cachable" would only be allowed for simple selects, it would not be > used for select for update or anything else. Anytime you are given the > result of a simple select, you are not guaranteed that the data won't change > underneath you. Not true, if you begin a transaction, you can be isolated of changes made to the database. >The primary use that you have suggested is for web sites, > and they certainly won't mind of the cache is 0.3seconds out of date. Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL is a far better system. Making PostgreSQL less accurate, less "correct" takes away, IMHO, the very reasons to use it.
On Mon, 2002-03-18 at 08:15, mlw wrote: > "Mattew T. O'Connor" wrote: > > [snip] > > >The primary use that you have suggested is for web sites, > > and they certainly won't mind of the cache is 0.3seconds out of date. > > Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL > is a far better system. Making PostgreSQL less accurate, less "correct" takes > away, IMHO, the very reasons to use it. > If you are using a web site and you need real time data within 0.3s, you've implemented on the wrong platform. It's as simple as that. In the web world, there are few applications where a "0.3s" of a window is notable. After all, that "0.3s" of a window can be anywhere within the system, including the web server, network, any front end caches, dns resolutions, etc. I tend to agree with Mettew. Granted, there are some application domains where this can be critical...generally speaking, web serving isn't one of them. That's why all of the solutions I offered were pointedly addressing a web server scenario and not a generalized database cache. I completely agree with you on that. In a generalized situation, the database should be managing and caching the data (which it already does). Greg
Greg Copeland wrote: > > On Mon, 2002-03-18 at 08:15, mlw wrote: > > "Mattew T. O'Connor" wrote: > > > > [snip] > > > > > >The primary use that you have suggested is for web sites, > > > and they certainly won't mind of the cache is 0.3seconds out of date. > > > > Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL > > is a far better system. Making PostgreSQL less accurate, less "correct" takes > > away, IMHO, the very reasons to use it. > > > > If you are using a web site and you need real time data within 0.3s, > you've implemented on the wrong platform. It's as simple as that. In > the web world, there are few applications where a "0.3s" of a window is > notable. After all, that "0.3s" of a window can be anywhere within the > system, including the web server, network, any front end caches, dns > resolutions, etc. This is totally wrong! An out of date cache can cause errors by returning results that are no longer valid, thus causing lookup issues. That is what ACID compliance is all about. > > I tend to agree with Mettew. Granted, there are some application > domains where this can be critical...generally speaking, web serving > isn't one of them. > > That's why all of the solutions I offered were pointedly addressing a > web server scenario and not a generalized database cache. I completely > agree with you on that. In a generalized situation, the database should > be managing and caching the data (which it already does). But it does not cache a query. An expensive query which does an index range scan and filters by a where clause could invalidate a good number of buffers in the buffer cache. If this or a number of queries like it are frequently repeated, verbatim, in a seldom changed table, why not cache them within PostgreSQL? It would improve overall performance by preserving more blocks in the buffer cache and eliminate a number of queries being executed. I don't see how caching can be an argument of applicability. I can understand it from a time/work point of view, but to debate that it is a useful feature seems ludicrous.
On Mon, 2002-03-18 at 10:08, mlw wrote: > Greg Copeland wrote: > > > > On Mon, 2002-03-18 at 08:15, mlw wrote: > > > "Mattew T. O'Connor" wrote: > > > > [snip] > > > > If you are using a web site and you need real time data within 0.3s, > > you've implemented on the wrong platform. It's as simple as that. In > > the web world, there are few applications where a "0.3s" of a window is > > notable. After all, that "0.3s" of a window can be anywhere within the > > system, including the web server, network, any front end caches, dns > > resolutions, etc. > > This is totally wrong! An out of date cache can cause errors by returning > results that are no longer valid, thus causing lookup issues. That is what ACID > compliance is all about. I understand what ACID is about. Question. Was the result set valid when it was cached? Yes. So will it be valid when it's returned as a cached result set? Yes. Might it be an out of date view. Sure...with a horribly small window for becoming "out of date". Will it cause look up problems? Might. No more than what you are proposing. In the mean time, the FE cached result set, performance wise, is beating the pants off of the database cached solution on both a specific work load and over all system performance. I should point out that once the FE cache has been notified that it's cache is invalid, the FE would no longer return the invalidated result set. I consider that to be a given, however, from some of your comments I get the impression that you think the invalid result set would continue to be served. Another way of thinking about that is...it's really not any different from the notification acting as the result returned result set...from a validity perspective. That is...if that had been the returned result set (the notification) from the database...it would be accurate (which in the case means the FE cache is now dirty and treated as such)...if the query is refreshed because it is now invalid..the result set is once again accurate and reflective of the database. Example... Database cache Query result set Result set returned (cached on database) local change to database (result set cache invalid) new query based on out of date queried result set Application cache Query result set (cached) Result set returned local change to database (app cache invalid and signaled) new query based on out of date queried result set Both have that problem since transactional boundaries are hard to keep across HTTP requests. This again, is why for web applications, a FE cache is perfectly acceptable for *most* needs. Also notice that your margin for error is more or less the same. [snip] > I don't see how caching can be an argument of applicability. I can understand > it from a time/work point of view, but to debate that it is a useful feature > seems ludicrous. I don't think I'm arguing if it's applicable or useful. Rather, I'm saying that faster results can be yielded by implementing it in the client with far less effort than it would take to implement in the BE. I am arguing that it's impact on overall system performance (though I really didn't do more than just touch on this topic) is questionable...granted, it may greatly enhance specific work loads...at the expense of others. Which shouldn't be too surprising as trade offs of some type are pretty common. At this point in time, I think we've both pretty well beat this topic up. Obviously there are two primary ways of viewing the situation. I don't think anyone is saying it's a bad idea...I think everyone is saying that it's easier to address elsewhere and that overall, the net returns may be at the expense of some other work loads. So, unless there are new pearls to be shared and gleaned, I think the topics been fairly well addressed. Does more need to said? Greg
On Sat, 2002-03-16 at 09:01, mlw wrote: > On a web site, a few specific queries get executed, unchanged, repeatedly. > Think about an ecommerce site, most of the time it is just a handful of basic > queries. These basic queries are usually against pretty large product tables. A > caching mechanism would make these queries pretty light weight. > > The arguments against caching: > > "It is an application issue" > This is completely wrong. Caching can not be done against a database without > knowledge of the database, i.e. when the data changes. But can't this be achieved by using a LISTEN/NOTIFY model, with user-created rules to NOTIFY the appropriate listener when a table changes? With a good notification scheme like this, you don't need to continually poll the DB for changes. You don't need to teach your cache a lot of things about the database, since most of that knowledge is encapsulated inside the rules, and supporting tables. My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the press that it deserves. If this model isn't widely used because of some deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be better spent fixing those problems than implementing the proposed caching scheme. If we're looking to provide a "quick and easy" caching scheme for users attracted to MySQL's query cache, why not provide this functionality through another application? I'm thinking about a generic "caching layer" that would sit in between Postgres and the database client. It could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY to allow it to efficiently be aware of database changes; it would create the necessary rules for the user, providing a simple interface to enabling query caching for a table or a set of tables? What does everyone think? > OK, let me have it, tell me how terrible an idea this is. tell me how wrong I > am. I think your goals are laudable (and I also appreciate the effort that you and everyone else puts into Postgres); I just think we could get most of the benefits without needing to implement potentially complex changes to Postgres internals. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
On Mon, 2002-03-18 at 20:35, Neil Conway wrote: [snip] > My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the > press that it deserves. If this model isn't widely used because of some > deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be > better spent fixing those problems than implementing the proposed > caching scheme. > > If we're looking to provide a "quick and easy" caching scheme for users > attracted to MySQL's query cache, why not provide this functionality > through another application? I'm thinking about a generic "caching > layer" that would sit in between Postgres and the database client. It > could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY > to allow it to efficiently be aware of database changes; it would create > the necessary rules for the user, providing a simple interface to > enabling query caching for a table or a set of tables? > > What does everyone think? > Yes...I was thinking that a generic library interface with a nice design pattern might meet this need rather well. Done properly, I think we can make it where all that, more or less, would be needed is application hooks which accept the result set to be cached and a mechanism to signal invalidation of the current cache....obviously that's not an exhaustive list... :) I haven't spent much time on this, but I'm fairly sure some library routines can be put together which would greatly reduce the effort of application coders to support fe-data caches and still be portable for even the Win32 port. Greg
> Yes...I was thinking that a generic library interface with a nice design > pattern might meet this need rather well. Done properly, I think we can > make it where all that, more or less, would be needed is application > hooks which accept the result set to be cached and a mechanism to signal > invalidation of the current cache....obviously that's not an exhaustive > list... :) A library implies that the application is running long enough to actually hear the notofication. Web apps start up, read from the database, and before any cache is needed they're done and the next one starts up, reading again from the database. Only currently open connections receive the notification. I think that you do need an entire layer... but that's not a bad thing necessarily. Have a daemon that stays connected for a long time and when a notification arrives, rewrite the cache (or mark it dirty). Other applications can read data from static files or shared memory, or even another communication socket with your daemon. There may be some way around running a daemon, so if you have a better solution please let me know. I think I am in favor of client caching in general, but "mlw" (sorry, I can't find your real name in the emails at hand) makes some good points. The most important one is that we don't want to change application architectures on everyone. It's easy if you just have to add "iscachable" to a query, it's hard if you have to start writing against a different set of routines (to grab from your client cache rather than a database). However, I am perfectly happy writing a client-side cache or using temp tables to store a result set. I also don't care that much if someone chooses PostgreSQL for their website (unless I'm responsible for it's success in some way :) ). That's me personally, if you want to attract more users from mysql, "iscachable" is very likely an attractive feature. Regards,Jeff
Jeff Davis wrote: > > > Yes...I was thinking that a generic library interface with a nice design > > pattern might meet this need rather well. Done properly, I think we can > > make it where all that, more or less, would be needed is application > > hooks which accept the result set to be cached and a mechanism to signal > > invalidation of the current cache....obviously that's not an exhaustive > > list... :) > > A library implies that the application is running long enough to actually > hear the notofication. Web apps start up, read from the database, and before > any cache is needed they're done and the next one starts up, reading again > from the database. Only currently open connections receive the notification. > > I think that you do need an entire layer... but that's not a bad thing > necessarily. Have a daemon that stays connected for a long time and when a > notification arrives, rewrite the cache (or mark it dirty). Other > applications can read data from static files or shared memory, or even > another communication socket with your daemon. > > There may be some way around running a daemon, so if you have a better > solution please let me know. > > I think I am in favor of client caching in general, but "mlw" (sorry, I can't > find your real name in the emails at hand) makes some good points. The most > important one is that we don't want to change application architectures on > everyone. It's easy if you just have to add "iscachable" to a query, it's > hard if you have to start writing against a different set of routines (to > grab from your client cache rather than a database). > > However, I am perfectly happy writing a client-side cache or using temp > tables to store a result set. I also don't care that much if someone chooses > PostgreSQL for their website (unless I'm responsible for it's success in some > way :) ). That's me personally, if you want to attract more users from mysql, > "iscachable" is very likely an attractive feature. I was thinking about this. There seems to be a consensus that caching means no ACID compliance. And everyone seems to think it needs to be limited. We can implement a non-ACID cache as a contrib function with some work to the function manager. Right now, the function manager can only return one value, or one set of values for a column. It should be possible, but require a lot of research, to enable the function manager to return a set of rows. If we could get that working, it could be fairly trivial to implement a cache as a contrib project. It would work something like this: select querycache("select * from mytable where foo='bar') ; This does two things that I would like to see: The ability to cache subselects independent of the full query. The ability to control which queries get cached. If we can get this row functionality in the function manager for 7.3, we could then implement MANY MANY enterprise level functionalities. Remote queries, query caching, external tables, etc. as contrib projects rather than full blown modifications to PostgreSQL.
On a side note, is it possible that we could add the "iscachable" which for now, would give cache bias? That is, allow for a mechanism to indicate that the pages that are required for this query will be frequently needed. I've not looked at the buffer cache implementation. Is it possible to somehow weight the corresponding pages in the cache so that it works something like this? For each query that does not use the biased pages ages the biased pages. Once the age threshold has been hit, allow the pages to be flushed per normal page replacement strategy. Every time the biased pages get a hit, renew the bias on the pages. I'm not sure this holds water but I'm thinking it would at least help insure that the pages in question are quickly available without having to constantly re-read them from disk. What ya think? Cache already work like this? Doable? On Tue, 2002-03-19 at 07:17, Jeff Davis wrote: > > Yes...I was thinking that a generic library interface with a nice design > > pattern might meet this need rather well. Done properly, I think we can > > make it where all that, more or less, would be needed is application > > hooks which accept the result set to be cached and a mechanism to signal > > invalidation of the current cache....obviously that's not an exhaustive > > list... :) > > A library implies that the application is running long enough to actually > hear the notofication. Web apps start up, read from the database, and before > any cache is needed they're done and the next one starts up, reading again > from the database. Only currently open connections receive the notification. I think you misunderstood me. My intension was the creation of a pattern library, whereby, creation of your "layer", without regard for the implementation requirements, can more easily be implemented. In other words, every time you need to implement this "layer" for various applications which address various problem domains, the library would serve as the heart of it reducing the amount of common code that would otherwise ave to be put in place in one form or another. Thus my reference to a design pattern. Should also be noted that some fast and slick web applications architecture often support some form of context shared persistence which would allow for caching to be implementing even in web application space. > > I think that you do need an entire layer... but that's not a bad thing > necessarily. Have a daemon that stays connected for a long time and when a > notification arrives, rewrite the cache (or mark it dirty). Other > applications can read data from static files or shared memory, or even > another communication socket with your daemon. Exactly...all of which, I'm thinking, can be encompassed within a design pattern, greatly reducing the effort required for a new "layer" application requirement. > > There may be some way around running a daemon, so if you have a better > solution please let me know. I hadn't spent enough time thinking out it. My initial thought was to provide the support functionality and let the coder determine his own route to achieve his goal. Does he need a new daemon or can it be built into his application? This is why a library is appealing. > > I think I am in favor of client caching in general, but "mlw" (sorry, I can't > find your real name in the emails at hand) makes some good points. The most > important one is that we don't want to change application architectures on > everyone. It's easy if you just have to add "iscachable" to a query, it's > hard if you have to start writing against a different set of routines (to > grab from your client cache rather than a database). Yes. I completely agree with that. > > However, I am perfectly happy writing a client-side cache or using temp > tables to store a result set. I also don't care that much if someone chooses > PostgreSQL for their website (unless I'm responsible for it's success in some > way :) ). That's me personally, if you want to attract more users from mysql, > "iscachable" is very likely an attractive feature. And/or provide language bindings to a caching library which greatly helps facilitate this. Granted, "iscachable" concept is certainly a powerful concept. Greg
On Tue, 2002-03-19 at 07:46, mlw wrote: [snip] > Right now, the function manager can only return one value, or one set of values > for a column. It should be possible, but require a lot of research, to enable > the function manager to return a set of rows. If we could get that working, it > could be fairly trivial to implement a cache as a contrib project. It would > work something like this: > > select querycache("select * from mytable where foo='bar') ; Interesting concept...but how would you know when the cache has become dirty? That would give you a set of rows...but I don't understand what would let you know your result set is invalid? Perhaps: select querycache( foobar_event, "select * from my table where foo='bar'" ) ; would automatically create a listen for you?? > > This does two things that I would like to see: The ability to cache subselects > independent of the full query. The ability to control which queries get cached. > > If we can get this row functionality in the function manager for 7.3, we could > then implement MANY MANY enterprise level functionalities. Remote queries, > query caching, external tables, etc. as contrib projects rather than full blown > modifications to PostgreSQL. Correct me if I'm wrong, but this concept would also be applicable to some clustering/distributed query (that what you meant by remote queries) needs too? Greg
On Tue, 2002-03-19 at 07:46, mlw wrote: > I was thinking about this. There seems to be a consensus that caching means no > ACID compliance. And everyone seems to think it needs to be limited. We can > implement a non-ACID cache as a contrib function with some work to the function > manager. Until know, I hadn't really thought about it...I just took it for granted since it was asserted...however, what isn't ACID about the approach that I offered? A - Not effected...it's read only and provided directly from the database, thus, it's still a function of the database. Any change resulting from atomic changes are notified to the cache, whereby it is repopulated. C - Not effected...the database is still responsible for keeping consistency. The cache is still read only. State is ensured as invalidation is notified by the database and data set should be returned consistent by the database or the database is broken. I - Again, the database is still performing this task and notifies the cache when updates need to take place. Again, Isolation isn't an issue because the cache is still read-only. D - Durability isn't a question either as, again, the database is still doing this. In the event of cache failure...it would be repopulated from the database...so it would be as durable as is the database. Please help me understand. Thanks,Greg
Jeff Davis <list-pgsql-hackers@dynworks.com> writes: > A library implies that the application is running long enough to actually > hear the notofication. Web apps start up, read from the database, and before > any cache is needed they're done and the next one starts up, reading again > from the database. Only currently open connections receive the notification. If your web app works this way than you already don't care about performance. People doing scalable web apps these days use connection pooling and session data kept in memory, so you already have a persistent layer running (whether it's your JVM, Apache process for mod_perl or PHP, or whatever). Really big apps definitely have a long-running daemon process that handles caching, session management (so you can have multiple webservers) etc etc... -Doug -- Doug McNaught Wireboard Industries http://www.wireboard.com/ Custom software development, systems and network consulting. Java PostgreSQL Enhydra Python Zope Perl Apache LinuxBSD...
On 19 Mar 2002, Greg Copeland wrote: > On Tue, 2002-03-19 at 07:46, mlw wrote: > [snip] > > > Right now, the function manager can only return one value, or one set of values > > for a column. It should be possible, but require a lot of research, to enable > > the function manager to return a set of rows. If we could get that working, it > > could be fairly trivial to implement a cache as a contrib project. It would > > work something like this: > > > > select querycache("select * from mytable where foo='bar') ; > > Interesting concept...but how would you know when the cache has become > dirty? That would give you a set of rows...but I don't understand what > would let you know your result set is invalid? > > Perhaps: select querycache( foobar_event, "select * from my table where > foo='bar'" ) ; would automatically create a listen for you?? Personally, I think this method of providing query caching is very messy. Why not just implement this along side the system relation cache? This maybe be slightly more time consuming but it will perform better and will be able to take advantage of Postgres's current MVCC. There would be three times when the cache would be interacted with 1) add a new result set ExecRetrieve() would need to be modified to handle a prepare-for-cache-update kind of feature. This would involve adding the tuple table slot data into a linked list. At the end of processing/transaction and if the query was successfuly, the prepare-for-cache-update list could be processed by AtCommit_Cache() (called from CommitTransaction()) and the shared cache updated. 2) attempt to get result set from cache Before planning in postgres.c, test if the query will produce an already cached result set. If so, send the data off from cache. 3) modification of underlying heap Like (1), produce a list inside the executor (ExecAppend(), ExecDelete(), ExecReplace() -> RelationInvalidateHeapTuple() -> PrepareForTupleInvalidation()) which gets processed by AtEOXactInvalidationMessages(). This results in the affected entries being purged. --- I'm not sure that cached results is a direction postgres need move in. But if it does, I think this a better way to do it (given that I may have overlooked something) than modifying the function manager (argh!). Gavin
Neil Conway <nconway@klamath.dyndns.org> writes: > My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get > the press that it deserves. If this model isn't widely used because > of some deficiencies in the LISTEN/NOTIFY implementation, IMHO our > time would be better spent fixing those problems than implementing > the proposed caching scheme. I would have to say I think a large part of the problem is lack of press---I've been hanging around pgsql-hackers for two or three years now, and until all this discussion, had never heard of LISTEN/NOTIFY. That doesn't mean I didn't miss prior mentions, but it certainly doesn't seem to come up often or get a lot of discussion when it does. Now that I know about it, well, it looks like it would be trivial to use it to implement cache invalidation logic in my HTML::Mason-based application---I need only have a long-lived process running on the web server(s) that uses the perl Pg interface, and sits listening, and when it sees notifies on given conditions, flush the appropriate local caches. I'd actually been contemplating cramming logic to do this down into the library I use for implementing the system logic, but had resisted doing it because it would make the library too tied to the web---this would be much easier. I won't even have to re-grab the results from the DB and reformat and all that crap, I can just spew the output from the last time the page was assembled---sounds better to me than what MySQL provides. Of course, I get a lot of this for free as a result of the tools I'm using, but surely this sort of thing shouldn't be all that hard to implement in other systems. Mike.
Gavin Sherry wrote: > I'm not sure that cached results is a direction postgres need move in. But > if it does, I think this a better way to do it (given that I may have > overlooked something) than modifying the function manager (argh!). I actually had an anterior motive. Your comment about caching not being a direction in which PostgreSQL needs to move, says it all. The general rank and file seems to agree. I think caching could speed up a number of things, certainly some of the stuff I have been working on. I think it would be more likely to get some sort of caching from a contrib project rather than to sway the core team. IMHO modifying the function manager to allow the return of a full row, and a "set of" full rows, answers a LOT of issues I have seen over the years with PostgreSQL extensibility. With a full row function API we can implement: (1) Remote Queries select remotequery(hostname, port, 'select * from foo'); (2) External queries select mysqlquery(hostname, port, 'select * from foo'); (3) Cached queries select cachedquery('select * from foo'); (4) Full text search select ftssquery(hostname, port, 'word1 and word2 and word3 not word4'); Again, with full row functions, we could prototype/implement many advanced features in PostgreSQL as contrib projects.
On Mon, Mar 18, 2002 at 09:35:51PM -0500, Neil Conway wrote: > > > > "It is an application issue" > > This is completely wrong. Caching can not be done against a database without > > knowledge of the database, i.e. when the data changes. > > But can't this be achieved by using a LISTEN/NOTIFY model, with > user-created rules to NOTIFY the appropriate listener when a table > changes? With a good notification scheme like this, you don't need to > continually poll the DB for changes. You don't need to teach your cache > a lot of things about the database, since most of that knowledge is > encapsulated inside the rules, and supporting tables. > > My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the > press that it deserves. If this model isn't widely used because of some > deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be > better spent fixing those problems than implementing the proposed > caching scheme. > > If we're looking to provide a "quick and easy" caching scheme for users > attracted to MySQL's query cache, why not provide this functionality > through another application? I'm thinking about a generic "caching > layer" that would sit in between Postgres and the database client. It > could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY > to allow it to efficiently be aware of database changes; it would create > the necessary rules for the user, providing a simple interface to > enabling query caching for a table or a set of tables? > > What does everyone think? Neil, this sounds like exactly the approach to follow up on: the one part of caching that _is_ the backends domain is knowing about invalidation events. And LISTEN/NOTIFY has _exactly_ the right behavior for that - you don't get out of transaction NOTIFYs, for example. As it stands, the application developer has to have intimate knowledge of the schema to set up the correct NOTIFY triggers for any given query. This works against developing a generic middleware solution, since one would have to parse the SQL to guess at the affected tables. How about an extension that autocreates INSERT/UPDATE/DELETE triggers that send NOTIFYs, based on all tables accessed by a given SELECT? As an example, I could see extending the Zope PsycoPG database adaptor, (which already tries some simple timeout based caching) to tack on something like: SELECT foo,bar FROM baz CACHENOTIFY <notifyname> whenever it creates a cache fora given query, then setting up the correct LISTEN to invalidate that cache. Alternatively, the LISTEN could be automatic. The name might be autogenerated, as well, to avoid collision probelms. Or perhaps _allow_ collisions to extend the notification set? (I could see setting _all_ the queries that generate one web page to NOTIFY together, since the entire page needs to be regenerated on cache invalidation) Then, the existing interface to SQL queries would allow the app developer to set logical caching policies for each query, independently. The backend does only the part that it alone can do: determine all the tables touched by a given query. The middleware and app developer are then free to cache at the appropriate level (SQL result set, fully formatted web page, etc.) This clearly is only useful in a connection pooling environment, so the long lived backends are around to receive the NOTIFYs. Hmm, no, I think it would be possible with this to have a seperate process do the LISTEN and cache invalidation, while a pool of other backends are used for general access, no? Seems like a win all around. Anyone else have comments? How insane would the auto trigger creation get? It seems to me that this would be similar in spirit to the referential integrity work, but more dynamic, since simple SELECTs would be creating backend triggers. Potential for DOS attacks, for ex. but not much worse I suppose than firing off big nasty cartesian cross product queries. Ross
mlw wrote: > [...] > > IMHO modifying the function manager to allow the return of a full row, and a > "set of" full rows, answers a LOT of issues I have seen over the years with > PostgreSQL extensibility. Sure. Actually I think you'll have an easy project with this one, because all the work has been done by Tom already. The function manager isn't the problem any more. It is that you cannot have such a "set of" function in the rangetable. So you have no mechanism to USE the result. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck wrote: > > mlw wrote: > > [...] > > > > IMHO modifying the function manager to allow the return of a full row, and a > > "set of" full rows, answers a LOT of issues I have seen over the years with > > PostgreSQL extensibility. > > Sure. Actually I think you'll have an easy project with this > one, because all the work has been done by Tom already. > > The function manager isn't the problem any more. It is that > you cannot have such a "set of" function in the rangetable. > So you have no mechanism to USE the result. I'm not sure I follow you. OK, maybe I identified the wrong portion of code. The idea is that the first return value could return an array of varlenas, one for each column, then a set of varlenas, one for each column. Is there a way to return this to PostgreSQL?
On Tue, 19 Mar 2002 12:12:52 CST, "Ross J. Reedstrom" wrote: > On Mon, Mar 18, 2002 at 09:35:51PM -0500, Neil Conway wrote: > > > > > > "It is an application issue" > > > This is completely wrong. Caching can not be done against a database without > > > knowledge of the database, i.e. when the data changes. > > ... > > > > If we're looking to provide a "quick and easy" caching scheme for users > > attracted to MySQL's query cache, why not provide this functionality > > through another application? > > ... > > > > What does everyone think? > > Neil, this sounds like exactly the approach to follow up on: > ... > > Seems like a win all around. Anyone else have comments? > ... I'm not certain the full direction of the thinking here, however, it seems to me that there are a few considerations that I would like to see/keep in mind: I feel that the caching should be SQL transparent. If it is implemented reasonably well, the performance gain should be pretty much universal. Yes, a large number of queries would never be called again, however, the results still need to be fetched into memory and "caching" them for later reuse should involve little more than a skipped free (think filesystem cache). It makes more sense to specify "non-cachable" in a query for tuning than "cacheable". This also means that just switching databases to PostgreSQL improves my performance. Also, it is very important that the caching should be transparent to the application. This means that the application should be able to connect to the database using the "standard" application interface (i.e., ODBC, PHP, Perl/DBI, etc.) This allows me to port my existing Oracle/DB2/MySQL/etc. application to pgsql through normal porting. If I have to implement a non-standard interface, I can likely gain even more performance through custom code and maintain reasonable database independence. While I am a strong believer in PostgreSQL, many of my customers have other demands/requirements. I still want to be able to use my existing code and libraries when using their database. Sticking with the "standards" allows me to develop best of class applications and interface to best of class databases. It also allows others to easily realize the value of PostgreSQL. Thanks, F Harvell
On Tue, 2002-03-19 at 19:20, F Harvell wrote: > I feel that the caching should be SQL transparent. If it is > implemented reasonably well, the performance gain should be pretty > much universal. Well, the simple query cache scheme that is currently being proposed would use a byte-by-byte comparison of the incoming query. I think the consensus is that for a lot of workloads, this would be a bad idea. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Neil Conway wrote: > > On Tue, 2002-03-19 at 19:20, F Harvell wrote: > > I feel that the caching should be SQL transparent. If it is > > implemented reasonably well, the performance gain should be pretty > > much universal. > > Well, the simple query cache scheme that is currently being proposed > would use a byte-by-byte comparison of the incoming query. I think the > consensus is that for a lot of workloads, this would be a bad idea. And this is what I have been trying to argue. Many SQL deployments execute a set of hard coded queries as the majority of the work load. The dynamic queries, obviously, will not be cached, but the vast majority of work will come out of the cache.
> > Well, the simple query cache scheme that is currently being proposed > > would use a byte-by-byte comparison of the incoming query. I think the > > consensus is that for a lot of workloads, this would be a bad idea. > > And this is what I have been trying to argue. Many SQL > deployments execute a > set of hard coded queries as the majority of the work load. The dynamic > queries, obviously, will not be cached, but the vast majority of > work will come > out of the cache. I feel like jumping into the fray! I think first we need to decide on some facts: 1. Implementing a byte-compatible query cache WILL improve the speed of repetitive queries over static data. 2. This can be incredibly useful for some web applications. 3. It is really hard to implement such a cache whilst keeping postgres maintainable and ACID compliant. 4. An application layer cache can be smarter and faster than a database layer cache, and this is currently the standard way of doing things. MySQL is bringing db layer caches to the mainstream. In a few years time - everyone might be doing it... 5. The main developers, or in fact the people with the ability to implement such a thing, either won't do it or can't be stuffed doing it... 6. Implementing prepared statements in postgres is a reasonable, valid and standard addition that will improve performance all over the place. This might also lead to "prepared views" - another performance improvement. 7. Improving the buffer manager's LRU policy can reduce problem of seq. scan wiping out cache. So, given the above it seems to me that: 1. The main developers are more interested in implementing prepared statements - which is cool, as this is a good performance improvement. 2. The main developers can look at replacing LRU to futher improve cache use. 3. We agree that such a query cache can be useful in some circumstances and could help postgres's performance in certain environments, but the will doesn't exist to implement it at the moment and it would also be difficult and messy. Put it on the TODO list maybe. 4. If someone happens to submit a magic patch that implements query caching in a perfectly ACID-compliant way, then it should be considered for inclusion. Why the heck not? Chris
On Tue, 2002-03-19 at 21:40, Christopher Kings-Lynne wrote: > 1. Implementing a byte-compatible query cache WILL improve the speed of > repetitive queries over static data. For some specific workloads, I think it has the potential to significantly improve performance. > 3. It is really hard to implement such a cache whilst keeping postgres > maintainable and ACID compliant. If we only consider implementations within Postgres itself, this is probably true. However, I haven't seen anyone comment that there are ACID-related concerns with the NOTIFY/LISTEN scheme that has been suggested (and therefore, with the middle-tier caching daemon I proposed). > 5. The main developers, or in fact the people with the ability to implement > such a thing, either won't do it or can't be stuffed doing it... I don't think it's a particularly good idea to implement the query cache within the database itself. As for the middle-tier caching daemon I suggested, I'm working on a design but I haven't begun implementation yet. > 3. We agree that such a query cache can be useful in some circumstances and > could help postgres's performance in certain environments, but the will > doesn't exist to implement it at the moment and it would also be difficult > and messy. Put it on the TODO list maybe. I agree that a query cache implemented inside Postgres proper would be messy and of dubious value, but I haven't heard of any show-stoppers WRT my proposal (of course, if anyone knows of one, please speak up). Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
On Tue, Mar 19, 2002 at 08:28:19PM -0500, Neil Conway wrote: > On Tue, 2002-03-19 at 19:20, F Harvell wrote: > > I feel that the caching should be SQL transparent. If it is > > implemented reasonably well, the performance gain should be pretty > > much universal. > > Well, the simple query cache scheme that is currently being proposed > would use a byte-by-byte comparison of the incoming query. I think the > consensus is that for a lot of workloads, this would be a bad idea. Apparently, no one actually read _my_ proposal, they just replied to it. All the arguements about if this kind of cache is any good have been thrashed out, up-thread. Apparently Mr. Harvell didn't feel the need to go back and read them. Going over them again is not productive - the next step is to see if there is anything to actually _code_ here. Caching is a hard problem, once you start digging into it. Going from no cache to some cache is (almost) always a win, but multiple caches in a datapath can interact in non-intuitive ways. And we _already_ have several, well tuned, clever caches in place. Anything that messes with them is going to be rejected, for sure. What I proposed is a sort of compromise: it is clear to me that the core developers are not very interested in the kind of cache Neil is talking about above, and would rather see query caching done in the app. What I proposed is extending the backends support for client-side caching, to make it easier (or possible) for middleware to automate the task. The bare bones are: flag a query in some way so the backend auto generates the appropriate NOTIFY triggers, so the middleware can do proper cache maintenance by LISTENing. I think I'll go away and write up my compromise proposal a little more clearly, and post it under a new subject, later. Perhaps we can then have a productive discussion about _it_, and not rehash old arguments. Ross P.S. HACKER sociological opinion below - feel free to skip - There are only three reasons to discuss features on HACKERS: to see if a proposed feature would be rejected, so you don't waste time implementing it; to refine a proposed implementation, so it doesn't have to be reworked; and to discuss an actual in-hand implementation. Notice that there's no way to skip step one: if the CVS committers don't like the feature, arguing for it on HACKERS won't make it magically better: providing an implementation that doesn't do bad things _might_. And you can always maintain an independent patch, or fork. So, we have a number of people who think a query cache would be a good idea. And core developers who are not convinced. I think one of the reasons is that, while it might be useful in some situations (even fairly common situations) it's neither elegant nor flexible. The PostgreSQL project has a long tradition of turning down narrow, 'good enough - it works for me' solutions, while looking for a better, more inclusive solution. Sometimes this has been a problem with missing features, but in the long run, it's been a win.
mlw wrote: > Jan Wieck wrote: > > > > mlw wrote: > > > [...] > > > > > > IMHO modifying the function manager to allow the return of a full row, and a > > > "set of" full rows, answers a LOT of issues I have seen over the years with > > > PostgreSQL extensibility. > > > > Sure. Actually I think you'll have an easy project with this > > one, because all the work has been done by Tom already. > > > > The function manager isn't the problem any more. It is that > > you cannot have such a "set of" function in the rangetable. > > So you have no mechanism to USE the result. > > I'm not sure I follow you. OK, maybe I identified the wrong portion of code. > > The idea is that the first return value could return an array of varlenas, one > for each column, then a set of varlenas, one for each column. > > Is there a way to return this to PostgreSQL? There is a way to return anything. The problem in PostgreSQL is to actually USE it. Our idea originally was to extend the capabilities of a rangetable entry. Currently, rangetable entries canonly hold a relation, which is a table or a view. After rewriting, they are down to real tables only. But basically, a rangetable entry should just be a row- source, so that a function returning a row-set could occur there too. In order to avoid multiple calls to the function because of nestloops and the like, I think when a set function occursin a RTE, it's result should be dumped into a sort-tape and that is used as the row source in the rest of the plan. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan wrote: > There is a way to return anything. The problem in PostgreSQL > is to actually USE it. > > Our idea originally was to extend the capabilities of a > rangetable entry. Currently, rangetable entries can only > hold a relation, which is a table or a view. After rewriting, > they are down to real tables only. > > But basically, a rangetable entry should just be a row- > source, so that a function returning a row-set could occur > there too. > > In order to avoid multiple calls to the function because of > nestloops and the like, I think when a set function occurs in > a RTE, it's result should be dumped into a sort-tape and that > is used as the row source in the rest of the plan. Hmmm...now that my SET NOT NULL patch is on the list, I'm thinking about what to tackle next. This is something that would be incredibly useful to me, but sounds pretty difficult (for someone unfamiliar with the code). So, some questions: 1. Can someone give me some pointers as to whereabouts I should look in the source code, and what I should be looking for, given that I've never played in the rewriter/executor before? 2. Maybe a general plan-of-attack? ie. What things would need to be changed and what order should I change them in... 3. Tell me it's worth me spending time on this - that it's not something a main developer could just code up in an evening? 4. What stuff has Tom done that should make it 'easy'? Cheers, Chris (Sick of returning arrays and comma delimited lists from functions!)
Greg Copeland wrote: > At this point in time, I think we've both pretty well beat this topic > up. Obviously there are two primary ways of viewing the situation. I > don't think anyone is saying it's a bad idea...I think everyone is > saying that it's easier to address elsewhere and that overall, the net > returns may be at the expense of some other work loads. So, unless > there are new pearls to be shared and gleaned, I think the topics been > fairly well addressed. Does more need to said? With a PREPARE/EXECUTE patch now out for approval, can I assume we will go with that first and see how far it gets us, and then revisit the idea of cached results. In this case, we are caching the query plan. The query still executes again in the executor, so the data is always fresh. In a sense, the buffer cache and disk are the caches, which don't need separate invalidation if some data changes in the table. The plan can get invalid if it calls a non-cachable function or the schema changes, or the constants used to generate the plan in the optimizer would generate a different plan from the constants used in a later query, or the analyze statistics changed. The MVCC ramifications of cached results and invalidation could be quite complex. The commit of a transaction could change tuple visibility rules even if the data modify statement was executed much earlier in the transaction. Also, on the NOTIFY/trigger idea, triggers are called on statement end, not transaction end, so if an UPDATE query is in a multi-statement transaction, another backend looking for the NOTIFY will receive it before the transaction commits, meaning it will not see the update. That seems like a problem. We do have deferrable constraints which will only do checking on transaction end, but I am not sure if that can be used for NOTIFY on transaction commit. Anyone? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Also, on the NOTIFY/trigger idea, triggers are called on statement end, > not transaction end, so if an UPDATE query is in a multi-statement > transaction, another backend looking for the NOTIFY will receive it > before the transaction commits, meaning it will not see the update. No it won't. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Also, on the NOTIFY/trigger idea, triggers are called on statement end, > > not transaction end, so if an UPDATE query is in a multi-statement > > transaction, another backend looking for the NOTIFY will receive it > > before the transaction commits, meaning it will not see the update. > > No it won't. Is this because NOTIFY is held for transaction end or because the triggers are held until transaction commit? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> No it won't. > Is this because NOTIFY is held for transaction end or because the > triggers are held until transaction commit? The former. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> No it won't. > > > Is this because NOTIFY is held for transaction end or because the > > triggers are held until transaction commit? > > The former. Thanks. I see it in the NOTIFY manual page now. Very nice. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026