Re: cache control? - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: cache control?
Date
Msg-id 005401c3e40b$aa1f5860$5e00030a@LaptopDellXP
Whole thread Raw
In response to Re: cache control?  (Jan Wieck <JanWieck@Yahoo.com>)
Responses Re: cache control?
List pgsql-hackers
Jan,

Happy to continue the discussion...though without changing my suggestion
that we defer any further more specialised improvements for now.

> Jan Wieck replied to...
> Simon Riggs wrote:
> > If we know ahead of time that a large scan is going to have this
effect,
> > why wait for the ARC to play its course, why not take exactly the
same
> > action?
> > Have large scans call StrategyHint also. (Maybe rename it...?)...of
> > course, some extra code to establish it IS a large scan...
> > ...large table lookup should wait until a shared catalog cache is
> > implemented
> 
> The problem with this is a) how to detect that something will be a
large
> scan, and b) how to decide what is a large scan in the first place.
> 

My thoughts are that we know immediately prior to execution whether or
not a plan calls for a full table scan (FTS) (or not). We also know the
table and therefore its size. A large table in this context is one that
would disrupt the cache if it made it onto T2. We can discuss an
appropriate and usefully simple rule, perhaps sizeoftable(T) > 2*C???

> Large sequential scans in warehousing are often part of more complex
> join operations. 

Yes, I agree. PostgreSQL is particularly prone to this currently,
because of the high number of plans that resolve to FTS. Complexity of
plan shouldn't effect the basic situation that we are reading all the
blocks of a table and putting them in sequentially into T1 and then
working on them. Plan complexity may increase the time that a T1 block
stays in memory, with subsequent increase in probability of promotion to
T1.

> And just because something returns a large number of
> result rows doesn't mean that the input data was that much.

I agree also that overall execution time may be unrelated to whether a
"large" table is involved. The number of output rows shouldn't have any
effect on input rows and thus data blocks that need to be cached.

(Jan gives a detailed analysis...ending with)
> Honestly, I don't even know what type of application could possibly
> produce such a screwed access pattern. And I am absolutely confident
one
> can find corner cases to wring down Oracles complicated configuration
> harness more easily.

I agree with everything you say. The algorithm copes well with almost
every sequential pattern of access and there is significant benefit from
ignoring the very very very rare cases that might give it problems.

My thoughts are about multiple concurrent accesses, specifically FTS on
large tables, rather than sequential ones.

> Buffers evicted from T1 are remembered in B1, and because of that even
> repeated sequential scans of the same large relation will only cycle
> through T1 blocks, never cause any turbulence in T2 or B2.

If we have a situation where a single backend makes repeated scans of
the same table, these will be sequential and will have no effect on T1.

In a DW situation, you are likely to have one or more very popular large
tables (maybe think of this as the "Fact table", if you have a
dimensional design). The tables are large and therefore query execution
times will be extended (and accepted by user). In this situation it is
very likely that: i) a single user/app submits multiple requests from
other windows/threads
Or simply,
ii) multiple users access the popular table

The common effect will be concurrent, rather than sequential, access to
the popular table. Different SQL statements will have different plans
and will perform scans of the same table at different rates because of
other joins, more complex WHERE clauses etc. Like waves at a beach
moving at different rates. Every time one scan catches up with another,
it will cause T1 hits for almost the whole of the T1 list, promoting all
of these blocks to the top of the T2 MRU and thus spoiling the cache -
if it hits one it will probably hit most of them. This will not happen
ALL the time, but I don't want it to happen EVER. Even in DW situation,
I still want to be inserting data regularly (that's how the table got
big!), so I have index blocks and other stuff that I want almost
permanently in memory. Concurrent access via an index might have the
same effect, though less dramatically.

The closer the size of a table I to C, the greater the likelihood that
these spurious cache hits will occur. (Of course, it might be argued
that these are worthwhile and genuine cache hits - I argue that they are
not wanted and this is the main basis of my discussion). Of course, if a
table does fit in memory than that is very good. If a table was, say
2*C, then spurious cache hits will occur often and spoil the whole of
T2.

The DBT-3 workload is very similar to TPC-D/TPC-H workload. The test
consists of a power test (all queries sequentially in random order) and
a throughput test (1 or more concurrent streams, each stream executing
all queries in a random order). When this benchmark first came out most
vendors chose to perform the throughput test with only 1 stream (though
with parallel processing)...I would say one reason for this is poor
cache management...hence recent changes in various commercial products.

In summary, I believe there is a reasonably common effect in DW
situations where concurrent query access to large and popular tables
will result in undesirable cache spoiling. This effect will still occur
even after the ARC improvements are introduced - though in every other
case I can think of, the ARC code is a major improvement on earlier
strategies and should be hailed as a major improvement in automatic
performance adaptation.

There are two solution ideas:
i) change the code so that FTS on large tables use the "no cache"
strategy that has already been developed to support Vaccuum.
ii) more complex: synchronise the FTS of the large table so that all
backends that want scans produce only one set of I/Os and they share the
block many times (yet still don't put it in cache!). FTS don't start at
"the beginning" every time, they start wherever a current scan has got
to, then loop back round at end (so average of two concurrent scans is
1.5 times as much I/O as a single FTS - a 25% saving on I/O). A more
detailed explanation may be required - this technique is in commercial
use within the Teradata rdbms. Implementing it would take some doing...

Best Regards

Simon




pgsql-hackers by date:

Previous
From: "Simon Riggs"
Date:
Subject: Re: 7.5 change documentation
Next
From: ohp@pyrenet.fr
Date:
Subject: what does it mean