Thread: Re: eWeek Poll: Which database is most critical to your
-----Original Message----- From: Neil Conway [mailto:nconway@klamath.dyndns.org] Sent: Tuesday, February 26, 2002 3:42 PM To: Dann Corbit Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to your On Tue, 2002-02-26 at 18:20, Dann Corbit wrote: > I don't see how it will do any good. There is no "prepare" in > Postgresql > and therefore you will simply be reexecuting the queries every time any > way. Also, parameter markers only work in embedded SQL and that is a > single tasking system. Perhaps I wasn't clear. The feature I'm proposing is this: When processing SELECT queries but before any real work has been done, lookup the query in a hash table. If it already exists, return the cached result. If it doesn't exist, execute the query and cache the result in the hash table. Optionally, we could not immediately cache the query, just increment a "frequency" counter stored in the hash table. If the counter goes above a certain constant, we decide that the query is worth caching, so we cache the full result in the hash table. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Statistical tools are a good idea, because they can tell us where indexes should be added. However, you cannot simply return the result of the previous query, because the contents may have changed since the last time it was executed. It is simply invalid to do that. If some other system is doing that, then it isn't a relational database. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< When processing INSERTs, UPDATEs and DELETEs, check if the query would affect any of the tables for which we are maintaing this cache. If so, flush the cache. This ensures that we will never return invalid results. We could perhaps be fancy and keep stats on which columns our cached queries utilize and which columns the modifying query will affect, but that is unlikely to be an overall win. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you know whether or not someone has affected the row that you are reading? If you do not know, then every single update, insert or delete will mean that you have to refresh. And not only that, you will also have to track it. For sure, it will make the whole system run more slowly rather than faster. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HOWEVER -- I don't see this feature as something that will appeal to, say, 75% of PgSQL users. If the table in question is being modified on a regular basis, or if a wide variety of queries are being issued, this cache isn't a good idea. Nevertheless, I think there are certainly some situations in which this cache is useful -- and furthermore, these kinds of "mostly read-only" situations are often where MySQL is chosen over PostgreSQL. Anyway, just putting this on the table -- if the consensus is that this isn't a very worthwhile feature, I won't bother with it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Very likely, it is only my limited understanding not really grasping what it is that you are trying to do. Even so, I don't think it really helps even for read only queries, unless it is exactly the same query with the same parameter markers and everything that was issued before. That is very unusual. Normally, you won't have the entire query hard- wired, but with allow the customer to do some sort of filtering of the data. I do (very much) like your idea to gather statistics. Again, this must have the complete query plan parsed so that you know what the parameter markers are. Otherwise, there will be completely different queries: SELECT * FROM employee WHERE ssn = '777-99-1234' SELECT * FROM employee WHERE ssn = '999-77-1234' SELECT * FROM employee WHERE ssn = '797-97-1234' SELECT * FROM employee WHERE ssn = '979-79-1234' SELECT * FROM employee WHERE ssn = '779-99-1234' when really only one query exits: SELECT * FROM employee WHERE ssn = ? If you do not parameterize the queries your statistics will be misleading at best and outright lies at worst. For example, suppose that the above query appears 300,000 times, but with different SSN's. And then someone runs this twice: SELECT * FROM reports WHERE type = 'q' The query that ran twice will be the most frequent, despite the fact that the really frequent query happens 150,000 times as often. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
On Tue, 2002-02-26 at 19:08, Dann Corbit wrote: > Statistical tools are a good idea, because they can tell us where > indexes should be added. However, you cannot simply return the result > of the previous query, because the contents may have changed since the > last time it was executed. It is simply invalid to do that. If some > other system is doing that, then it isn't a relational database. No -- as I said, any inserts, updates or deletes that affect the table in question will cause a full cache flush. > How do you know whether or not someone has affected the row that you > are reading? If you do not know, then every single update, insert or > delete will mean that you have to refresh. Yes, that is true. > And not only that, you will > also have to track it. For sure, it will make the whole system run > more slowly rather than faster. I don't think tracking changes imposes a lot of overhead -- it is relatively simple to determine if a query affects a given table. > Very likely, it is only my limited understanding not really grasping > what it is that you are trying to do. Even so, I don't think it really > helps even for read only queries, unless it is exactly the same query > with the same parameter markers and everything that was issued before. > That is very unusual. Normally, you won't have the entire query hard- > wired, but with allow the customer to do some sort of filtering of the > data. Hmmm... the more I think about it, the more unusual it would be for _exactly_ the same query to be repeated a lot. However, the article reported a significant performance gain when this feature was enabled. That could mean that: (a) the performance measurements/benchmarks used by the article were synthetic and don't reflect real database applications (b) the feature MySQL implements is different than the one I am describing When I get a chance I'll investigate further the technique used by MySQL to see if (b) is the case. However, it is beginning to look like this isn't a good idea, overall. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
On Wed, 2002-02-27 at 05:23, Neil Conway wrote: > On Tue, 2002-02-26 at 19:08, Dann Corbit wrote: > > Statistical tools are a good idea, because they can tell us where > > indexes should be added. However, you cannot simply return the result > > of the previous query, because the contents may have changed since the > > last time it was executed. It is simply invalid to do that. If some > > other system is doing that, then it isn't a relational database. > > No -- as I said, any inserts, updates or deletes that affect the table > in question will cause a full cache flush. > > > How do you know whether or not someone has affected the row that you > > are reading? If you do not know, then every single update, insert or > > delete will mean that you have to refresh. > > Yes, that is true. > > > And not only that, you will > > also have to track it. For sure, it will make the whole system run > > more slowly rather than faster. > > I don't think tracking changes imposes a lot of overhead -- it is > relatively simple to determine if a query affects a given table. Perhaps you can do it in a simple way for MySQL which has no rules/triggers/foreign keys with ON DELETE CASCADE. The only way I can think of is by putting some kind of rule or trigger on the table affected. > > Hmmm... the more I think about it, the more unusual it would be for > _exactly_ the same query to be repeated a lot. However, the article > reported a significant performance gain when this feature was enabled. > That could mean that: > > (a) the performance measurements/benchmarks used by the article were > synthetic and don't reflect real database applications I think that a Slashdot-type web application would probably benefit a lot. ---------------- Hannu
> > (a) the performance measurements/benchmarks used by the article were > > synthetic and don't reflect real database applications > > I think that a Slashdot-type web application would probably benefit a > lot. Hows that? Slashdot serves (primarily) static pages which are periodically generated. Anything that isn't static isn't hit very often, or is dynamic enough that it doesn't matter (second or 3rd comment deep in chat thread). It'll help generation of their front page for those users logged in, but even then the content itself is cached in memory on the webservers. Their second bottleneck is using apache with modperl. They'd get alot faster results with a custom written webserver which serves slashdot pages only -- even if it was written as a daemon in perl. Why have the network overhead to hit a cached resource when you can have it right in the webserver itself? Use a table with a single row which holds a timestamp to determine when cache should be expired (triggers update the time on writes). That query is done everytime. Thats it though, nothing else repeats more than 5 times unless the data changes (5 as there are 5 webservers). But, speed up that query by 2x and we can serve 2x more pages :) create table cache_expire (lastwrite timestamp); insert into cache_expire default values; select lastwrite from cache_expire; Think you can get a big boost to that? Those phpBB's, etc would get a much biggest boost by using a 5MB shared memory segment to store the articles, etc on the webserver itself as it avoids network and connection conjestion of the database entirely.