Thread: Re: eWeek Poll: Which database is most critical to your

Re: eWeek Poll: Which database is most critical to your

From
"Dann Corbit"
Date:
-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:42 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your


On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
> I don't see how it will do any good.  There is no "prepare" in
> Postgresql
> and therefore you will simply be reexecuting the queries every time
any
> way.  Also, parameter markers only work in embedded SQL and that is a
> single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:
   When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Statistical tools are a good idea, because they can tell us where
indexes should be added.  However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed.  It is simply invalid to do that.  If some
other system is doing that, then it isn't a relational database.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
   When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
How do you know whether or not someone has affected the row that you
are reading?  If you do not know, then every single update, insert or
delete will mean that you have to refresh.  And not only that, you will
also have to track it.  For sure, it will make the whole system run
more slowly rather than faster.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do.  Even so,  I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual.  Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.

I do (very much) like your idea to gather statistics.  Again, this must
have the complete query plan parsed so that you know what the parameter
markers are.  Otherwise, there will be completely different queries:

SELECT * FROM employee WHERE ssn = '777-99-1234'
SELECT * FROM employee WHERE ssn = '999-77-1234'
SELECT * FROM employee WHERE ssn = '797-97-1234'
SELECT * FROM employee WHERE ssn = '979-79-1234'
SELECT * FROM employee WHERE ssn = '779-99-1234'

when really only one query exits:
SELECT * FROM employee WHERE ssn = ?

If you do not parameterize the queries your statistics will be
misleading
at best and outright lies at worst.  For example, suppose that the above
query appears 300,000 times, but with different SSN's.  And then someone
runs this twice:

SELECT * FROM reports WHERE type = 'q'

The query that ran twice will be the most frequent, despite the fact
that
the really frequent query happens 150,000 times as often.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


Re: eWeek Poll: Which database is most critical to your

From
Neil Conway
Date:
On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:
> Statistical tools are a good idea, because they can tell us where
> indexes should be added.  However, you cannot simply return the result
> of the previous query, because the contents may have changed since the
> last time it was executed.  It is simply invalid to do that.  If some
> other system is doing that, then it isn't a relational database.

No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.

> How do you know whether or not someone has affected the row that you
> are reading?  If you do not know, then every single update, insert or
> delete will mean that you have to refresh.

Yes, that is true.

>  And not only that, you will
> also have to track it.  For sure, it will make the whole system run 
> more slowly rather than faster.

I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.

> Very likely, it is only my limited understanding not really grasping 
> what it is that you are trying to do.  Even so,  I don't think it really
> helps even for read only queries, unless it is exactly the same query
> with the same parameter markers and everything that was issued before.
> That is very unusual.  Normally, you won't have the entire query hard-
> wired, but with allow the customer to do some sort of filtering of the
> data.

Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:
   (a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications
   (b) the feature MySQL implements is different than the one I am
describing

When I get a chance I'll investigate further the technique used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC



Re: eWeek Poll: Which database is most critical to your

From
Hannu Krosing
Date:
On Wed, 2002-02-27 at 05:23, Neil Conway wrote:
> On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:
> > Statistical tools are a good idea, because they can tell us where
> > indexes should be added.  However, you cannot simply return the result
> > of the previous query, because the contents may have changed since the
> > last time it was executed.  It is simply invalid to do that.  If some
> > other system is doing that, then it isn't a relational database.
> 
> No -- as I said, any inserts, updates or deletes that affect the table
> in question will cause a full cache flush.
> 
> > How do you know whether or not someone has affected the row that you
> > are reading?  If you do not know, then every single update, insert or
> > delete will mean that you have to refresh.
> 
> Yes, that is true.
> 
> >  And not only that, you will
> > also have to track it.  For sure, it will make the whole system run 
> > more slowly rather than faster.
> 
> I don't think tracking changes imposes a lot of overhead -- it is
> relatively simple to determine if a query affects a given table.

Perhaps you can do it in a simple way for MySQL which has no
rules/triggers/foreign keys with ON DELETE CASCADE.

The only way I can think of is by putting some kind of rule or trigger
on the table affected.

> 
> Hmmm... the more I think about it, the more unusual it would be for
> _exactly_ the same query to be repeated a lot. However, the article
> reported a significant performance gain when this feature was enabled.
> That could mean that:
> 
>     (a) the performance measurements/benchmarks used by the article were
> synthetic and don't reflect real database applications

I think that a Slashdot-type web application would probably benefit a
lot.
----------------
Hannu



Re: eWeek Poll: Which database is most critical to your

From
"Rod Taylor"
Date:
> >     (a) the performance measurements/benchmarks used by the
article were
> > synthetic and don't reflect real database applications
>
> I think that a Slashdot-type web application would probably benefit
a
> lot.

Hows that?  Slashdot serves (primarily) static pages which are
periodically generated.  Anything that isn't static isn't hit very
often, or is dynamic enough that it doesn't matter (second or 3rd
comment deep in chat thread).  It'll help generation of their front
page for those users logged in, but even then the content itself is
cached in memory on the webservers.  Their second bottleneck is using
apache with modperl.  They'd get alot faster results with a custom
written webserver which serves slashdot pages only -- even if it was
written as a daemon in perl.

Why have the network overhead to hit a cached resource when you can
have it right in the webserver itself?


Use a table with a single row which holds a timestamp to determine
when cache should be expired (triggers update the time on writes).
That query is done everytime.  Thats it though, nothing else repeats
more than 5 times unless the data changes (5 as there are 5
webservers).  But, speed up that query by 2x and we can serve 2x more
pages :)

create table cache_expire (lastwrite timestamp);
insert into cache_expire default values;
select lastwrite from cache_expire;

Think you can get a big boost to that?

Those phpBB's, etc would get a much biggest boost by using a 5MB
shared memory segment to store the articles, etc on the webserver
itself as it avoids network and connection conjestion of the database
entirely.