Re: 7.3.1 New install, large queries are slow - Mailing list pgsql-performance

From Sean Chittenden
Subject Re: 7.3.1 New install, large queries are slow
Date
Msg-id 20030127081745.GK15936@perrin.int.nxad.com
Whole thread Raw
In response to Re: 7.3.1 New install, large queries are slow  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 7.3.1 New install, large queries are slow
List pgsql-performance
> >> Yeah, but isn't that theory a hangover from pre-Unix operating systems?
>
> > Informix, oracle, etc all do raw device access bypassing the kernels
> > buffering, etc.  So they need heaping gobules of memory to do the same
> > thing the kernel does..
>
> D'oh, I believe Jeff's put his finger on it.  You need lotsa RAM if you
> are trying to bypass the OS.  But Postgres would like to work with the
> OS, not bypass it.
>
> > but since they know the exact patterns of data and
> > how things will be done they can fine tune their buffer caches to get much
> > better performance than the kernel (15-20% in informix's case) since the
> > kernel needs to be a "works good generally"
>
> They go to all that work for 15-20% ???  Remind me not to follow that
> primrose path.  I can think of lots of places where we can buy 20% for
> less work than implementing (and maintaining) our own raw-device access
> layer.

This is related somewhat to the raw device access discussion.  This is
a quote from Matt Dillion (FreeBSD VM guru) on the topic of disk
caches (Message-Id:
<200301270657.h0R6v2qH071774@apollo.backplane.com>) and a few bits at
the end:


### Begin quote
    Mmmmm.  Basically what it comes down to is that without foreknowledge
    of the data locations being accessed, it is not possible for any
    cache algorithm to adapt to all the myriad ways data might be accessed.
    If you focus the cache on one methodology it will probably perform
    terribly when presented with some other methodology.

    What this means is that for the cases where you *KNOW* how a program
    intends to access a dataset larger then main memory, it is better
    to have the program explicitly cache/not-cache the data under program
    control rather then trying to force the system cache to adapt.

    I'll also be nice and decode some of Terry's Jargon for the rest of
    the readers.

:will result in significant failure of random page replacement to
:result in cache hits; likewise, going to 85% overage will practically
:guarantee an almost 100% failure rate, as cyclical access with random
:replacement is statistically more likely, in aggregate, to replace
:the pages which are there longer (the probability is iterative and
:additive: it's effectively a permutation).

    What Terry is saying is that if you have a dataset that is 2x
    the size of your cache, the cache hit rate on that data with random
    page replacement is NOT going to be 50%.  This is because with random
    page replacement the likelihood of a piece of data being found in
    the cache depends on how long the data has been sitting in the cache.
    The longer the data has been sitting in the cache, the less likely you
    will find it when you need it (because it is more likely to have been
    replaced by the random replacement algorithm over time).

    So, again, the best caching methodology to use in the case where
    you *know* how much data you will be accessing and how you will access
    it is to build the caching directly into your program and not depend
    on system caching at all (for datasets which are far larger then
    main memory).

    This is true of all systems, not just BSD.  This is one reason why
    databases do their own caching (another is so databases know when an
    access will require I/O for scheduling reasons, but that's a different
    story).

    The FreeBSD VM caching system does prevent one process from exhausting
    another process's cached data due to a sequential access, but the
    FreeBSD VM cache does not try to outsmart sequential data accesses to
    datasets which are larger then available cache space because it's an
    insanely difficult (impossible) problem to solve properly without
    foreknowledge of what data elements will be accessed when.

    This isn't to say that we can't improve on what we have now.
    I certainly think we can.  But there is no magic bullet that will
    deal with every situation.

                        -Matt
### End quote

So if there really is only a 15-20% performance gain to be had from
using raw disk access, that 15-20% loss comes from not being able to
tell the OS what to cache, what not to cache, and what order to have
the pages in... which only really matters if there is RAM available to
the kernel to cache, and that it is able to determine what is valuable
to cache in the course of its operations.  Predictive caching by the
OS isn't done because it understands PostgreSQL, because it
understands a generic algorithm for page hits/misses.

What is interesting after reading this, however, is the prospect of a
15-20% speed up on certain tables that we know are accessed frequently
by implicitly specifying a set of data to be preferred in a user space
cache.  It's impossible for the OS to cache the pages that make the
biggest impact on user visible performance given the OS has no
understanding of what pages make a big difference on user visible
performance, a user land database process, however, would.

As things stand, it's entirely possible for a set of large queries to
come through and wipe the kernel's cache that smaller queries were
using.  Once a cache misses, the kernel then has to fetch the data
again which could slow down over all number of transactions per
second.  That said, this is something that an in-database scheduler
could avoid by placing a lower priority on larger, more complex
queries with the assumption being that having the smaller queries
continue to process and get in/out is more important than shaving a
few seconds off of a larger query that would deplete the cache used by
the smaller queries.  Oh to be a DBA and being able to make those
decisions instead of the kernel...

Hrm, so two ideas or questions come to mind:

1) On some of my really large read only queries, it would be SUUUPER
   nice to be able to re-nice the process from SQL land to 5, 10, or
   even 20.  IIRC, BSD's VM system is smart enough to prevent lower
   priority jobs from monopolizing the disk cache, which would let the
   smaller faster turn around queries, continue to exist with their
   data in the kernel's disk cache.  (some kind of query complexity
   threshold that results in a reduction of priority or an explicit
   directive to run at a lower priority)

2) Is there any way of specifying that a particular set of tables
   should be kept in RAM or some kind of write through cache?  I know
   data is selected into a backend out of the catalogs, but would it
   be possible to have them kept in memory and only re-read on change
   with some kind of semaphore?  Now that all system tables are in
   their own schemas (pg_catalog and pg_toast), would it be hard to
   set a flag on a change to those tables that would cause the
   postmaster, or children, to re-read then instead of rely on their
   cache?  With copy-on-write forking, this could be pretty efficient
   if the postmaster did this and forked off a copy with the tables
   already in memory instead of on disk.

Just a few ideas/ramblings, hope someone finds them interesting... the
renice function is one that I think I'll spend some time looking into
here shortly actually.  -sc

--
Sean Chittenden

pgsql-performance by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [GENERAL] optimizing query
Next
From: Ron Johnson
Date:
Subject: Re: LOCK TABLE & speeding up mass data loads