Re: eWeek Poll: Which database is most critical to your - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: eWeek Poll: Which database is most critical to your
Date
Msg-id D90A5A6C612A39408103E6ECDD77B82906F3F3@voyager.corporate.connx.com
Whole thread Raw
Responses Re: eWeek Poll: Which database is most critical to your
List pgsql-hackers
-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:42 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your


On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
> I don't see how it will do any good.  There is no "prepare" in
> Postgresql
> and therefore you will simply be reexecuting the queries every time
any
> way.  Also, parameter markers only work in embedded SQL and that is a
> single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:
   When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Statistical tools are a good idea, because they can tell us where
indexes should be added.  However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed.  It is simply invalid to do that.  If some
other system is doing that, then it isn't a relational database.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
   When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
How do you know whether or not someone has affected the row that you
are reading?  If you do not know, then every single update, insert or
delete will mean that you have to refresh.  And not only that, you will
also have to track it.  For sure, it will make the whole system run
more slowly rather than faster.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do.  Even so,  I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual.  Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.

I do (very much) like your idea to gather statistics.  Again, this must
have the complete query plan parsed so that you know what the parameter
markers are.  Otherwise, there will be completely different queries:

SELECT * FROM employee WHERE ssn = '777-99-1234'
SELECT * FROM employee WHERE ssn = '999-77-1234'
SELECT * FROM employee WHERE ssn = '797-97-1234'
SELECT * FROM employee WHERE ssn = '979-79-1234'
SELECT * FROM employee WHERE ssn = '779-99-1234'

when really only one query exits:
SELECT * FROM employee WHERE ssn = ?

If you do not parameterize the queries your statistics will be
misleading
at best and outright lies at worst.  For example, suppose that the above
query appears 300,000 times, but with different SSN's.  And then someone
runs this twice:

SELECT * FROM reports WHERE type = 'q'

The query that ran twice will be the most frequent, despite the fact
that
the really frequent query happens 150,000 times as often.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


pgsql-hackers by date:

Previous
From: John Gray
Date:
Subject: Re: Refactoring of command.c
Next
From: Thomas Swan
Date:
Subject: Re: eWeek Poll: Which database is most critical to your