Re: Advice: Where could I be of help? - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Advice: Where could I be of help?
Date
Msg-id DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com
Whole thread Raw
In response to Advice: Where could I be of help?  ("Curtis Faith" <curtis@galtair.com>)
Responses Re: Advice: Where could I be of help?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
tom lane wrote:
> But more globally, I think that our worst problems these days have to do
> with planner misestimations leading to bad plans.  The planner is
> usually *capable* of generating a good plan, but all too often it picks
> the wrong one.  We need work on improving the cost modeling equations
> to be closer to reality.  If that's at all close to your sphere of
> interest then I think it should be #1 priority --- it's both localized,
> which I think is important for a first project, and potentially a
> considerable win.

This seems like a very interesting problem. One of the ways that I thought
would be interesting and would solve the problem of trying to figure out the
right numbers is to have certain guesses for the actual values based on
statistics gathered during vacuum and general running and then have the
planner run the "best" plan.

Then during execution if the planner turned out to be VERY wrong about
certain assumptions the execution system could update the stats that led to
those wrong assumptions. That way the system would seek the correct values
automatically. We could also gather the stats that the system produces for
certain actual databases and then use those to make smarter initial guesses.

I've found that I can never predict costs. I always end up testing
empirically and find myself surprised at the results.

We should be able to make the executor smart enough to keep count of actual
costs (or a statistical approximation) without introducing any significant
overhead.

tom lane also wrote:
> There is no "cache flushing".  We have a shared buffer cache management
> algorithm that's straight LRU across all buffers.  There's been some
> interest in smarter cache-replacement code; I believe Neil Conway is
> messing around with an LRU-2 implementation right now.  If you've got
> better ideas we're all ears.

Hmmm, this is the area that I think could lead to huge performance gains.

Consider a simple system with a table tbl_master that gets read by each
process many times but with very infrequent inserts and that contains about
3,000 rows. The single but heavily used index for this table is contained in
a btree with a depth of three with 20 - 8K pages in the first two levels of
the btree.

Another table tbl_detail with 10 indices that gets very frequent inserts.
There are over 300,000 rows. Some queries result in index scans over the
approximatley 5,000 8K pages in the index.

There is a 40M shared cache for this system.

Everytime a query which requires the index scan runs it will blow out the
entire cache since the scan will load more blocks than the cache holds. Only
blocks that are accessed while the scan is going will survive. LRU is bad,
bad, bad!

LRU-2 might be better but it seems like it still won't give enough priority
to the most frequently used blocks. I don't see how it would do better for
the above case.

I once implemented a modified cache algorithm that was based on the clock
algorithm for VM page caches. VM paging is similar to databases in that
there is definite locality of reference and certain pages are MUCH more
likely to be requested.

The basic idea was to have a flag in each block that represented the access
time in clock intervals. Imagine a clock hand sweeping across a clock, every
access is like a tiny movement in the clock hand. Blocks that are not
accessed during a sweep are candidates for removal.

My modification was to use access counts to increase the durability of the
more accessed blocks. Each time a block is accessed it's flag is shifted
left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
Every so many cache accesses (and synchronously when the cache is full) a
pass is made over each block, right shifting the flags (a clock sweep). This
can also be done one block at a time each access so the clock is directly
linked to the cache access rate. Any blocks with 0 are placed into a doubly
linked list of candidates for removal. New cache blocks are allocated from
the list of candidates. Accesses of blocks in the candidate list just
removes them from the list.

An index root node page would likely be accessed frequently enough so that
all it's bits would be set so it would take ShiftN clock sweeps.

This algorithm increased the cache hit ratio from 40% to about 90% for the
cases I tested when compared to a simple LRU mechanism. The paging ratio is
greatly dependent on the ratio of the actual database size to the cache
size.

The bottom line that it is very important to keep blocks that are frequently
accessed in the cache. The top levels of large btrees are accessed many
hundreds (actually a power of the number of keys in each page) of times more
frequently than the leaf pages. LRU can be the worst possible algorithm for
something like an index or table scan of large tables since it flushes a
large number of potentially frequently accessed blocks in favor of ones that
are very unlikely to be retrieved again.

tom lane also wrote:
> This is an interesting area.  Keep in mind though that Postgres is a
> portable DB that tries to be agnostic about what kernel and filesystem
> it's sitting on top of --- and in any case it does not run as root, so
> has very limited ability to affect what the kernel/filesystem do.
> I'm not sure how much can be done without losing those portability
> advantages.

The kinds of things I was thinking about should be very portable. I found
that simply writing the cache in order of the file system offset results in
very greatly improved performance since it lets the head seek in smaller
increments and much more smoothly, especially with modern disks. Most of the
time the file system will create files are large sequential bytes on the
physical disks in order. It might be in a few chunks but those chunks will
be sequential and fairly large.

tom lane also wrote:
> Well, not really all that isolated.  The bottom-level index code doesn't
> know whether you're doing INSERT or UPDATE, and would have no easy
> access to the original tuple if it did know.  The original theory about
> this was that the planner could detect the situation where the index(es)
> don't overlap the set of columns being changed by the UPDATE, which
> would be nice since there'd be zero runtime overhead.  Unfortunately
> that breaks down if any BEFORE UPDATE triggers are fired that modify the
> tuple being stored.  So all in all it turns out to be a tad messy to fit
> this in :-(.  I am unconvinced that the impact would be huge anyway,
> especially as of 7.3 which has a shortcut path for dead index entries.

Well, this probably is not the right place to start then.

- Curtis



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [SQL] [GENERAL] CURRENT_TIMESTAMP
Next
From: "Curtis Faith"
Date:
Subject: Potential Large Performance Gain in WAL synching