Thread: Seq scans roadmap

Seq scans roadmap

From
Heikki Linnakangas
Date:
Here's my roadmap for the "scan-resistant buffer cache" and 
"synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to the 
freelist, forcing a lot of WAL flushes. Instead, use a backend-private 
ring of shared buffers that are recycled. This is what Simon's 
"scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely to 
be accessed in the near future, therefore it should be recycled. If 
vacuum doesn't dirty the page, it's best to reuse the buffer immediately 
for the next page. However, if the buffer is dirty (and not just because 
we set hint bits), we ought to delay writing it to disk until the 
corresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but I 
think the ring should be dynamically sized. Start with a small ring, and 
whenever you need to do a WAL flush to write a dirty buffer, increase 
the ring size. On every full iteration through the ring, decrease its 
size to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say it 
won't get worse than what we have now. In the future, we can use the 
buffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This is 
Jeff's "synchronized scans" patch. When a seq scan starts on a table 
larger than some threshold, it starts from where the previous seq scan 
is currently, or where it ended. This will synchronize the scans so that 
for two concurrent scans the total I/O is halved in the best case. There 
should be no other effect on performance.

If you have a partitioned table, or union of multiple tables or any 
other plan where multiple seq scans are performed in arbitrary order, 
this change won't change the order the partitions are scanned and won't 
therefore ensure they will be synchronized.


Now that we have both pieces of the puzzle in place, it's time to 
consider what more we can do with them:


3A. To take advantage of the "cache trail" of a previous seq scan, scan 
backwards from where the previous seq scan ended, until a you hit a 
buffer that's not in cache.

This will allow taking advantage of the buffer cache even if the table 
doesn't fit completely in RAM. That can make a big difference if the 
table size is just slightly bigger than RAM, and can avoid the nasty 
surprise when a table grows beyond RAM size and queries start taking 
minutes instead of seconds.

This should be a non-controversial change on its own from performance 
point of view. No query should get slower, and some will become faster. 
But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cache 
by evicting other pages from the cache. In CVS HEAD, as soon as the 
table is larger than shared_buffers, the pages in the buffer won't be 
used to speed up running the same query again, and there's no reason to 
believe the pages read in would be more useful than any other page in 
the database, and in particular the pages that were in the buffer cache 
before the huge seq scan. If the table being scanned is > 5 * 
shared_buffers, the scan will evict every other page from the cache if 
there's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large, 
even with the change 3B to read the pages that are in cache first, using 
all shared_buffers to cache the table will only speed up the query by 
10%. We should not spoil the cache for such a small gain, and use the 
local buffer ring strategy instead. It's better to make queries that are 
slow anyway a little bit slower, than making queries that are normally 
really fast, slow.


As you may notice, 3A and 3B are at odds with each other. We can 
implement both, but you can't use both strategies in the same scan. 
Therefore we need to have decision logic of some kind to figure out 
which strategy is optimal.

A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers    -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers    -> start from last read page, keep in cache> 5 * shared_buffers    -> start from last read
page,use buffer ring
 

I'm not sure about the constants, we might need to make them GUC 
variables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the local 
buffer ring. Take all buffers with usage_count=0, plus a few with 
usage_count=1 from the clock sweep. That way if there's a lot of buffers 
in the buffer cache that are seldomly used, we'll use more buffers to 
cache the large scan, and vice versa. And no matter how large the scan, 
it wouldn't blow all buffers from the cache. But if you execute the same 
query again, the buffers left in the cache from the last scan were 
apparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for 
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up 
Simon's patch (step 1), run some performance tests with vacuum, and 
submit a patch for that. Then I'll move to Jeff's patch (step 2).

Thoughts? Everyone happy with the roadmap?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
Heikki,

On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching.  The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.

On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation.  The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache".  In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.

To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer.  The effect has
been demonstrated to be significant - in the 20%+ range.  Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.

We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads.  I
would expect benefits on OLTP as well.

The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution

- Luke

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
> Heikki Linnakangas
> Sent: Tuesday, May 08, 2007 3:40 AM
> To: PostgreSQL-development
> Cc: Jeff Davis; Simon Riggs
> Subject: [HACKERS] Seq scans roadmap
>
> Here's my roadmap for the "scan-resistant buffer cache" and
> "synchronized scans" patches.
>
> 1. Fix the current vacuum behavior of throwing dirty buffers
> to the freelist, forcing a lot of WAL flushes. Instead, use a
> backend-private ring of shared buffers that are recycled.
> This is what Simon's "scan-resistant buffer manager" did.
>
> The theory here is that if a page is read in by vacuum, it's
> unlikely to be accessed in the near future, therefore it
> should be recycled. If vacuum doesn't dirty the page, it's
> best to reuse the buffer immediately for the next page.
> However, if the buffer is dirty (and not just because we set
> hint bits), we ought to delay writing it to disk until the
> corresponding WAL record has been flushed to disk.
>
> Simon's patch used a fixed size ring of buffers that are
> recycled, but I think the ring should be dynamically sized.
> Start with a small ring, and whenever you need to do a WAL
> flush to write a dirty buffer, increase the ring size. On
> every full iteration through the ring, decrease its size to
> trim down an unnecessarily large ring.
>
> This only alters the behavior of vacuums, and it's pretty
> safe to say it won't get worse than what we have now. In the
> future, we can use the buffer ring for seqscans as well; more
> on that on step 3.
>
> 2. Implement the list/table of last/ongoing seq scan
> positions. This is Jeff's "synchronized scans" patch. When a
> seq scan starts on a table larger than some threshold, it
> starts from where the previous seq scan is currently, or
> where it ended. This will synchronize the scans so that for
> two concurrent scans the total I/O is halved in the best
> case. There should be no other effect on performance.
>
> If you have a partitioned table, or union of multiple tables
> or any other plan where multiple seq scans are performed in
> arbitrary order, this change won't change the order the
> partitions are scanned and won't therefore ensure they will
> be synchronized.
>
>
> Now that we have both pieces of the puzzle in place, it's
> time to consider what more we can do with them:
>
>
> 3A. To take advantage of the "cache trail" of a previous seq
> scan, scan
> backwards from where the previous seq scan ended, until a you hit a
> buffer that's not in cache.
>
> This will allow taking advantage of the buffer cache even if
> the table
> doesn't fit completely in RAM. That can make a big difference if the
> table size is just slightly bigger than RAM, and can avoid the nasty
> surprise when a table grows beyond RAM size and queries start taking
> minutes instead of seconds.
>
> This should be a non-controversial change on its own from performance
> point of view. No query should get slower, and some will
> become faster.
> But see step 3B:
>
> 3B. Currently, sequential scans on a large table spoils the
> buffer cache
> by evicting other pages from the cache. In CVS HEAD, as soon as the
> table is larger than shared_buffers, the pages in the buffer won't be
> used to speed up running the same query again, and there's no
> reason to
> believe the pages read in would be more useful than any other page in
> the database, and in particular the pages that were in the
> buffer cache
> before the huge seq scan. If the table being scanned is > 5 *
> shared_buffers, the scan will evict every other page from the
> cache if
> there's no other activity in the database (max usage_count is 5).
>
> If the table is much larger than shared_buffers, say 10 times
> as large,
> even with the change 3B to read the pages that are in cache
> first, using
> all shared_buffers to cache the table will only speed up the query by
> 10%. We should not spoil the cache for such a small gain, and use the
> local buffer ring strategy instead. It's better to make
> queries that are
> slow anyway a little bit slower, than making queries that are
> normally
> really fast, slow.
>
>
> As you may notice, 3A and 3B are at odds with each other. We can
> implement both, but you can't use both strategies in the same scan.
> Therefore we need to have decision logic of some kind to figure out
> which strategy is optimal.
>
> A simple heuristic is to decide based on the table size:
>
> < 0.1*shared_buffers    -> start from page 0, keep in cache
> (like we do now)
> < 5 * shared_buffers    -> start from last read page, keep in cache
>  > 5 * shared_buffers    -> start from last read page, use buffer ring
>
> I'm not sure about the constants, we might need to make them GUC
> variables as Simon argued, but that would be the general approach.
>
> In the future, I'm envisioning a smarter algorithm to size the local
> buffer ring. Take all buffers with usage_count=0, plus a few with
> usage_count=1 from the clock sweep. That way if there's a lot
> of buffers
> in the buffer cache that are seldomly used, we'll use more buffers to
> cache the large scan, and vice versa. And no matter how large
> the scan,
> it wouldn't blow all buffers from the cache. But if you
> execute the same
> query again, the buffers left in the cache from the last scan were
> apparently useful, so we use a bigger ring this time.
>
> I'm going to do this incrementally, and we'll see how far we get for
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
> Simon's patch (step 1), run some performance tests with vacuum, and
> submit a patch for that. Then I'll move to Jeff's patch (step 2).
>
> Thoughts? Everyone happy with the roadmap?
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>



Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Luke Lonergan wrote:
> On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
> implement dynamic I/O caching.  The experiments have shown that benefit
> of re-using PG buffer cache on large sequential scans is vanishingly
> small when the buffer cache size is small compared to the system memory.
> Since this is a normal and recommended situation (OS I/O cache is
> auto-tuning and easy to administer, etc), IMO the effort to optimize
> buffer cache reuse for seq scans > 1 x buffer cache size is not
> worthwhile.

That's interesting. Care to share the results of the experiments you
ran? I was thinking of running tests of my own with varying table sizes.

The main motivation here is to avoid the sudden drop in performance when
a table grows big enough not to fit in RAM. See attached diagram for
what I mean. Maybe you're right and the effect isn't that bad in practice.

I'm thinking of attacking 3B first anyway, because it seems much simpler
to implement.

> On 3B: The scenario described is "multiple readers seq scanning large
> table and sharing bufcache", but in practice this is not a common
> situation.  The common situation is "multiple queries joining several
> small tables to one or more large tables that are >> 1 x bufcache".  In
> the common scenario, the dominant factor is the ability to keep the
> small tables in bufcache (or I/O cache for that matter) while running
> the I/O bound large table scans as fast as possible.

How is that different from what I described?

> To that point - an important factor in achieving max I/O rate for large
> tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
> This is commonly in the range of 512KB -> 2MB, which is only important
> when considering a bound on the size of the ring buffer.  The effect has
> been demonstrated to be significant - in the 20%+ range.  Another thing
> to consider is the use of readahead inside the heapscan, in which case
> sizes >= 32KB are very effective.

Yeah I remember the discussion on the L2 cache a while back.

What do you mean with using readahead inside the heapscan? Starting an
async read request?

> The modifications you suggest here may not have the following
> properties:
> - don't pollute bufcache for seqscan of tables > 1 x bufcache
> - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
> minimize L2 cache pollution

So the difference is that you don't want 3A (the take advantage of pages
already in buffer cache) strategy at all, and want the buffer ring
strategy to kick in earlier instead. Am I reading you correctly?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
Heikki,

> That's interesting. Care to share the results of the
> experiments you ran? I was thinking of running tests of my
> own with varying table sizes.

Yah - it may take a while - you might get there faster.

There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM.  "SELECT
COUNT(*)" measures CPU consumption in the aggregation node, not scan
rate.

Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.
> The main motivation here is to avoid the sudden drop in
> performance when a table grows big enough not to fit in RAM.
> See attached diagram for what I mean. Maybe you're right and
> the effect isn't that bad in practice.

There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache.  The second is severe, the first is almost insignificant
(for common queries).
> How is that different from what I described?

My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large (> 1x bufcache) table such that
they can share the "first load" of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument.  I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query
speed.

> Yeah I remember the discussion on the L2 cache a while back.
>
> What do you mean with using readahead inside the heapscan?
> Starting an async read request?

Nope - just reading N buffers ahead for seqscans.  Subsequent calls use
previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
> > The modifications you suggest here may not have the following
> > properties:
> > - don't pollute bufcache for seqscan of tables > 1 x bufcache
> > - for tables > 1 x bufcache use a ringbuffer for I/O that
> is ~ 32KB to
> > minimize L2 cache pollution
>
> So the difference is that you don't want 3A (the take
> advantage of pages already in buffer cache) strategy at all,
> and want the buffer ring strategy to kick in earlier instead.
> Am I reading you correctly?

Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

- Luke



Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Luke Lonergan wrote:
>> What do you mean with using readahead inside the heapscan? 
>> Starting an async read request?
> 
> Nope - just reading N buffers ahead for seqscans.  Subsequent calls use
> previously read pages.  The objective is to issue contiguous reads to
> the OS in sizes greater than the PG page size (which is much smaller
> than what is needed for fast sequential I/O).

Are you filling multiple buffers in the buffer cache with a single 
read-call? The OS should be doing readahead for us anyway, so I don't 
see how just issuing multiple ReadBuffers one after each other helps.

> Yes, I think the ring buffer strategy should be used when the table size
> is > 1 x bufcache and the ring buffer should be of a fixed size smaller
> than L2 cache (32KB - 128KB seems to work well).

I think we want to let the ring grow larger than that for updating 
transactions and vacuums, though, to avoid the WAL flush problem.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> Nope - just reading N buffers ahead for seqscans.  Subsequent
> calls use previously read pages.  The objective is to issue
> contiguous reads to the OS in sizes greater than the PG page
> size (which is much smaller than what is needed for fast
> sequential I/O).

Problem here is that eighter you issue the large read into throwaway
private memory and hope that when you later read 8k you get the page
from OS buffercache, or you need ScatterGather IO and a way to grab 32
buffers at once.

> Yes, I think the ring buffer strategy should be used when the
> table size is > 1 x bufcache and the ring buffer should be of
> a fixed size smaller than L2 cache (32KB - 128KB seems to work well).

How do you merge those two objectives? It seems the ring buffer needs to
be at least as large as the contiguous read size.
Thus you would need at least 256k ring buffer. Better yet have twice the
IO size as ring buffer size, so two sessions can alternately take the
lead while the other session still blocks a prev page. Modern L2 cache
is 8 Mb, so 512k seems no problem ?

Andreas


Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> >> What do you mean with using readahead inside the heapscan?
> >> Starting an async read request?
> >
> > Nope - just reading N buffers ahead for seqscans.  Subsequent calls
> > use previously read pages.  The objective is to issue
> contiguous reads
> > to the OS in sizes greater than the PG page size (which is much
> > smaller than what is needed for fast sequential I/O).
>
> Are you filling multiple buffers in the buffer cache with a
> single read-call?

yes, needs vector or ScatterGather IO.
> The OS should be doing readahead for us
> anyway, so I don't see how just issuing multiple ReadBuffers
> one after each other helps.

Last time I looked OS readahead was only comparable to 32k blocked
reads.
256k blocked reads still perform way better. Also when the OS is
confronted
with an IO storm the 256k reads perform way better than OS readahead.

Andreas


Re: Seq scans roadmap

From
Gregory Stark
Date:
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:

>> Are you filling multiple buffers in the buffer cache with a 
>> single read-call?
>
> yes, needs vector or ScatterGather IO.

I would expect that to get only moderate improvement. To get the full benefit
I would think you would want to either fire off a separate thread to do the
read-ahead, use libaio, or funnel the read-ahead requests to a separate thread
like our bgwriter only it would be a bgreader or something like that.

>> The OS should be doing readahead for us 
>> anyway, so I don't see how just issuing multiple ReadBuffers 
>> one after each other helps.
>
> Last time I looked OS readahead was only comparable to 32k blocked reads.
> 256k blocked reads still perform way better. Also when the OS is confronted
> with an IO storm the 256k reads perform way better than OS readahead.

Well that's going to depend on the OS. Last I checked Linux's readahead logic
is pretty straightforward and doesn't try to do any better than 32k readahead
and is easily fooled. However I wouldn't be surprised if that's changed. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Seq scans roadmap

From
Jeff Davis
Date:
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
> I'm going to do this incrementally, and we'll see how far we get for 
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up 
> Simon's patch (step 1), run some performance tests with vacuum, and 
> submit a patch for that. Then I'll move to Jeff's patch (step 2).

I think it's best to postpone 3A (one aspect of my patch). There are a
couple reasons:

1) The results didn't show enough benefit yet.
2) Complex interactions with Simon's patch.

I think it's an area of research for the future, but for now I just want
to concentrate on the most important aspect of my patch: the
synchronized scanning ( #2 in your list ).

Regards,Jeff Davis



Re: Seq scans roadmap

From
Jeff Davis
Date:
On Tue, 2007-05-08 at 07:47 -0400, Luke Lonergan wrote:
> Heikki,
> 
> On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
> implement dynamic I/O caching.  The experiments have shown that benefit
> of re-using PG buffer cache on large sequential scans is vanishingly
> small when the buffer cache size is small compared to the system memory.
> Since this is a normal and recommended situation (OS I/O cache is
> auto-tuning and easy to administer, etc), IMO the effort to optimize
> buffer cache reuse for seq scans > 1 x buffer cache size is not
> worthwhile.
> 

I think 3A is still an interesting idea, but I agree that it needs some
results to back it up. Let's save 3A for the next release so that we
have more time to see.

> To that point - an important factor in achieving max I/O rate for large
> tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
> This is commonly in the range of 512KB -> 2MB, which is only important
> when considering a bound on the size of the ring buffer.  The effect has
> been demonstrated to be significant - in the 20%+ range.  Another thing
> to consider is the use of readahead inside the heapscan, in which case
> sizes >= 32KB are very effective.

One thing I'd like us to keep in mind is to have a reasonable number of
buffers active for a sequential scan. If the number is too small, my
sync scans might not even work. Right now my patch only reports every 16
pages, so 32KB (=4 pages) is not going to work for sync scans (I suppose
only testing will tell).

Regards,Jeff Davis



Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> >> Are you filling multiple buffers in the buffer cache with a single
> >> read-call?
> >
> > yes, needs vector or ScatterGather IO.
>
> I would expect that to get only moderate improvement.

The vast improvement comes from 256k blocksize.

> To get
> the full benefit I would think you would want to either fire
> off a separate thread to do the read-ahead, use libaio, or
> funnel the read-ahead requests to a separate thread like our
> bgwriter only it would be a bgreader or something like that.

I like bgreader :-) But that looks even more difficult than grabbing 32
[scattered or contiguous] buffers at once.
Especially in a situation where there is no concurrent load it would be
nice to do CPU work while waiting for the next read ahead IO. If there
is enough parallel CPU load it is actually not so important. So I opt,
that on a high load server you get nearly all benefit without any sort
of aio.

> >> The OS should be doing readahead for us anyway, so I don't see how
> >> just issuing multiple ReadBuffers one after each other helps.
> >
> > Last time I looked OS readahead was only comparable to 32k blocked
reads.
> > 256k blocked reads still perform way better. Also when the OS is
> > confronted with an IO storm the 256k reads perform way better than
OS readahead.
>
> Well that's going to depend on the OS. Last I checked Linux's
> readahead logic is pretty straightforward and doesn't try to
> do any better than 32k readahead and is easily fooled.
> However I wouldn't be surprised if that's changed.

My test was on AIX, 32 or 64k seem quite common, at least as default
setting.
Also on some OS's (like HPUX) OS readahead and writebehind strategy
changes with large IO blocksizes, imho beneficially.

Andreas


Re: Seq scans roadmap

From
"Simon Riggs"
Date:
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
> Here's my roadmap for the "scan-resistant buffer cache" and 
> "synchronized scans" patches.
> 
> 1. Fix the current vacuum behavior of throwing dirty buffers to the 
> freelist, forcing a lot of WAL flushes. Instead, use a backend-private 
> ring of shared buffers that are recycled. This is what Simon's 
> "scan-resistant buffer manager" did.
> 
> The theory here is that if a page is read in by vacuum, it's unlikely to 
> be accessed in the near future, therefore it should be recycled. If 
> vacuum doesn't dirty the page, it's best to reuse the buffer immediately 
> for the next page. However, if the buffer is dirty (and not just because 
> we set hint bits), we ought to delay writing it to disk until the 
> corresponding WAL record has been flushed to disk.
> 
> Simon's patch used a fixed size ring of buffers that are recycled, but I 
> think the ring should be dynamically sized. Start with a small ring, and 
> whenever you need to do a WAL flush to write a dirty buffer, increase 
> the ring size. On every full iteration through the ring, decrease its 
> size to trim down an unnecessarily large ring.
> 
> This only alters the behavior of vacuums, and it's pretty safe to say it 
> won't get worse than what we have now. 

I think thats too much code, why not just leave it as it is. Would a
dynamic buffer be substantially better? If not, why bother?

> In the future, we can use the 
> buffer ring for seqscans as well; more on that on step 3.

There was clear benefit for that. You sound like you are suggesting to
remove the behaviour for Seq Scans, which wouldn't make much sense??

> 2. Implement the list/table of last/ongoing seq scan positions. This is 
> Jeff's "synchronized scans" patch. When a seq scan starts on a table 
> larger than some threshold, it starts from where the previous seq scan 
> is currently, or where it ended. This will synchronize the scans so that 
> for two concurrent scans the total I/O is halved in the best case. There 
> should be no other effect on performance.
> 
> If you have a partitioned table, or union of multiple tables or any 
> other plan where multiple seq scans are performed in arbitrary order, 
> this change won't change the order the partitions are scanned and won't 
> therefore ensure they will be synchronized.
> 
> 
> Now that we have both pieces of the puzzle in place, it's time to 
> consider what more we can do with them:
> 
> 
> 3A. To take advantage of the "cache trail" of a previous seq scan, scan 
> backwards from where the previous seq scan ended, until a you hit a 
> buffer that's not in cache.
> 
> This will allow taking advantage of the buffer cache even if the table 
> doesn't fit completely in RAM. That can make a big difference if the 
> table size is just slightly bigger than RAM, and can avoid the nasty 
> surprise when a table grows beyond RAM size and queries start taking 
> minutes instead of seconds.
> 
> This should be a non-controversial change on its own from performance 
> point of view. No query should get slower, and some will become faster. 
> But see step 3B:
> 
> 3B. Currently, sequential scans on a large table spoils the buffer cache 
> by evicting other pages from the cache. In CVS HEAD, as soon as the 
> table is larger than shared_buffers, the pages in the buffer won't be 
> used to speed up running the same query again, and there's no reason to 
> believe the pages read in would be more useful than any other page in 
> the database, and in particular the pages that were in the buffer cache 
> before the huge seq scan. If the table being scanned is > 5 * 
> shared_buffers, the scan will evict every other page from the cache if 
> there's no other activity in the database (max usage_count is 5).
> 
> If the table is much larger than shared_buffers, say 10 times as large, 
> even with the change 3B to read the pages that are in cache first, using 
> all shared_buffers to cache the table will only speed up the query by 
> 10%. We should not spoil the cache for such a small gain, and use the 
> local buffer ring strategy instead. It's better to make queries that are 
> slow anyway a little bit slower, than making queries that are normally 
> really fast, slow.
> 
> 
> As you may notice, 3A and 3B are at odds with each other. We can 
> implement both, but you can't use both strategies in the same scan.

Not sure I've seen any evidence of that.

Most scans will be solo and so should use the ring buffer, since there
is clear evidence of that. If there were evidence to suggest the two
patches conflict then we should turn off the ring buffer only when
concurrent scans are in progress (while remembering that concurrent
scans will not typically overlap as much as the synch scan tests show
and so for much of their execution they too will be solo).

> Therefore we need to have decision logic of some kind to figure out 
> which strategy is optimal.
> 
> A simple heuristic is to decide based on the table size:
> 
> < 0.1*shared_buffers    -> start from page 0, keep in cache (like we do now)
> < 5 * shared_buffers    -> start from last read page, keep in cache
>  > 5 * shared_buffers    -> start from last read page, use buffer ring
> 
> I'm not sure about the constants, we might need to make them GUC 
> variables as Simon argued, but that would be the general approach.

If you want to hardcode it, I'd say use the ring buffer on scans of 1000
blocks or more, or we have a GUC. Sizing things to shared_buffers isn't
appropriate because of the effects of partitioning, as I argued in my
last post, which I think is still valid.

> Thoughts? Everyone happy with the roadmap?

I think separating the patches is now the best way forward, though both
are good.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Seq scans roadmap

From
"CK Tan"
Date:
Hi,

In reference to the seq scans roadmap, I have just submitted a patch  
that addresses some of the concerns.

The patch does this:

1. for small relation (smaller than 60% of bufferpool), use the  
current logic
2. for big relation:- use a ring buffer in heap scan- pin first 12 pages when scan starts- on consumption of every
4-page,read and pin the next 4-page- invalidate used pages of in the scan so they do not force out  
 
other useful pages

4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h

If there are interests, I can submit another scan patch that returns  
N tuples at a time, instead of current one-at-a-time interface. This  
improves code locality and further improve performance by another  
10-20%.

For TPCH 1G tables, we are seeing more than 20% improvement in scans  
on the same hardware.

------------------------------------------------------------------------ 
-
----- PATCHED VERSION
------------------------------------------------------------------------ 
-
gptest=# select count(*) from lineitem;  count
---------
6001215
(1 row)

Time: 2117.025 ms

------------------------------------------------------------------------ 
-
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------ 
-
gptest=# select count(*) from lineitem;  count
---------
6001215
(1 row)

Time: 2722.441 ms


Suggestions for improvement are welcome.

Regards,
-cktan
Greenplum, Inc.

On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:

> Luke Lonergan wrote:
>>> What do you mean with using readahead inside the heapscan?  
>>> Starting an async read request?
>> Nope - just reading N buffers ahead for seqscans.  Subsequent  
>> calls use
>> previously read pages.  The objective is to issue contiguous reads to
>> the OS in sizes greater than the PG page size (which is much smaller
>> than what is needed for fast sequential I/O).
>
> Are you filling multiple buffers in the buffer cache with a single  
> read-call? The OS should be doing readahead for us anyway, so I  
> don't see how just issuing multiple ReadBuffers one after each  
> other helps.
>
>> Yes, I think the ring buffer strategy should be used when the  
>> table size
>> is > 1 x bufcache and the ring buffer should be of a fixed size  
>> smaller
>> than L2 cache (32KB - 128KB seems to work well).
>
> I think we want to let the ring grow larger than that for updating  
> transactions and vacuums, though, to avoid the WAL flush problem.
>
> -- 
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>




Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> In reference to the seq scans roadmap, I have just submitted
> a patch that addresses some of the concerns.
>
> The patch does this:
>
> 1. for small relation (smaller than 60% of bufferpool), use
> the current logic 2. for big relation:
>     - use a ring buffer in heap scan
>     - pin first 12 pages when scan starts
>     - on consumption of every 4-page, read and pin the next 4-page
>     - invalidate used pages of in the scan so they do not
> force out other useful pages

A few comments regarding the effects:

I do not see how this speedup could be caused by readahead, so what are
the effects ?
(It should make no difference to do the CPU work for count(*) inbetween
reading each block when the pages are not dirtied)
Is the improvement solely reduced CPU because no search for a free
buffer is needed and/or L2 cache locality ?

What effect does the advance pinnig have, avoid vacuum ?

A 16 x 8k page ring is too small to allow the needed IO blocksize of
256k.
The readahead is done 4 x one page at a time (=32k).
What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
3/4 the trail for followers and bgwriter ?

I think in anticipation of doing a single IO call for more that one
page, the KillAndReadBuffer function should be split into two parts. One
that does the killing
for n pages, and one that does the reading for n pages.
Killing n before reading n would also have the positive effect of
grouping perhaps needed writes (not interleaving them with the reads).

I think the 60% Nbuffers is a very good starting point. I would only
introduce a GUC when we see evidence that it is needed (I agree with
Simon's partitioning comments, but I'd still wait and see).

Andreas


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Zeugswetter Andreas ADI SD wrote:
>> In reference to the seq scans roadmap, I have just submitted 
>> a patch that addresses some of the concerns.
>>
>> The patch does this:
>>
>> 1. for small relation (smaller than 60% of bufferpool), use 
>> the current logic 2. for big relation:
>>     - use a ring buffer in heap scan
>>     - pin first 12 pages when scan starts
>>     - on consumption of every 4-page, read and pin the next 4-page
>>     - invalidate used pages of in the scan so they do not 
>> force out other useful pages
> 
> A few comments regarding the effects:
> 
> I do not see how this speedup could be caused by readahead, so what are
> the effects ?

I was wondering that as well. We'd really need to test all the changes 
separately to see where the improvements are really coming from.

Also, that patch doesn't address the VACUUM issue at all. And using a 
small fixed size ring with scans that do updates can be devastating. I'm 
experimenting with different ring sizes for COPY at the moment. Too 
small ring leads to a lot of WAL flushes, it's basically the same 
problem we have with VACUUM in CVS HEAD.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> Also, that patch doesn't address the VACUUM issue at all. And
> using a small fixed size ring with scans that do updates can
> be devastating. I'm experimenting with different ring sizes
> for COPY at the moment. Too small ring leads to a lot of WAL
> flushes, it's basically the same problem we have with VACUUM
> in CVS HEAD.

My first take on that would be to simply abandon any dirty (and actually
also any still pinned) buffer from the ring and replace the ring slot
with a buffer from the freelist.
If the freelist is empty and LSN allows writing the buffer, write it
(and maybe try to group these).
If the LSN does not allow the write, replace the slot with a buffer from
LRU.

Andreas


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Zeugswetter Andreas ADI SD wrote:
>> Also, that patch doesn't address the VACUUM issue at all. And 
>> using a small fixed size ring with scans that do updates can 
>> be devastating. I'm experimenting with different ring sizes 
>> for COPY at the moment. Too small ring leads to a lot of WAL 
>> flushes, it's basically the same problem we have with VACUUM 
>> in CVS HEAD.
> 
> My first take on that would be to simply abandon any dirty (and actually
> also any still pinned) buffer from the ring and replace the ring slot
> with a buffer from the freelist.
> If the freelist is empty and LSN allows writing the buffer, write it
> (and maybe try to group these).
> If the LSN does not allow the write, replace the slot with a buffer from
> LRU.

That would effectively disable the ring for COPY and the 2nd phase of 
VACUUM.

One problem with looking at the LSN is that you need the content lock to 
read it, and I wouldn't want to add any new locking. It could be done 
inside FlushBuffer when we hold the lock anyway, but I'm afraid the 
changes would be pretty invasive.

I'm struggling to get a grip of what the optimal ring size is under 
various circumstances. Some thoughts I have this far:
- a small ring gives better L2 cache behavior
- for read-only queries, and for queries that just hint bits, 1 buffer 
is enough
- small ring with query that writes WAL (COPY, mass updates, 2nd phase 
of VACUUM) leads to a lot of WAL flushes, which can become bottleneck.

But all these assumptions need to be validated. I'm setting up tests 
with different ring sizes and queries to get a clear picture of this:
- VACUUM on a clean table
- VACUUM on a table with 1 dead tuple per page
- read-only scan, large table
- read-only scan, table fits in OS cache
- COPY

In addition, I'm going to run VACUUM in a DBT-2 test to see the affect 
on other queries running concurrently.

I think a ring that grows when WAL flushes occur covers all the use 
cases reasonably well, but I need to do the testing...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> But all these assumptions need to be validated. I'm setting up tests 
> with different ring sizes and queries to get a clear picture of this:
> - VACUUM on a clean table
> - VACUUM on a table with 1 dead tuple per page
> - read-only scan, large table
> - read-only scan, table fits in OS cache
> - COPY

Just to keep you guys informed, here's my results on a read-only scan on 
a table bigger than shared_buffers but smaller than RAM:
 select-1    | 00:00:10.853831 select-1    | 00:00:10.380667 select-1    | 00:00:11.530528 select-2    |
00:00:08.634105select-2    | 00:00:02.674084 select-4    | 00:00:02.65664 select-8    | 00:00:02.662922 select-16   |
00:00:02.682475select-32   | 00:00:02.693163 select-64   | 00:00:02.722031 select-128  | 00:00:02.873645 select-256  |
00:00:03.185586select-512  | 00:00:03.534285 select-1024 | 00:00:03.741867
 

lshw utility tells me that this server has 32KB of L1 cache and 4MB of 
L2 cache. The performance starts to drop between 64-128 buffers, which 
is 512 - 1024 KB, so I'm not sure how it's related to cache size but 
using a small number of buffers is clearly better than using a large number.

However, it caught me by total surprise that the performance with 1 
buffer is so horrible. Using 2 buffers is enough to avoid whatever the 
issue is with just 1 buffer. I have no idea what's causing that. There 
must be some interaction that I don't understand.

All the numbers are quite repeatable, I ran the same test script many 
times. The runtime of the first select-2 test however varied between 
3-10 seconds, somehow the bad karma from using just 1 buffer in the 
earlier test carries over to the next test.

I'm not sure what to think about this, but I'll set up more test 
scenarios with VACUUM and COPY.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> However, it caught me by total surprise that the performance with 1 
> buffer is so horrible. Using 2 buffers is enough to avoid whatever the 
> issue is with just 1 buffer. I have no idea what's causing that. There 
> must be some interaction that I don't understand.

Ok, I found the reason for that. I was using this query for the selects:
SELECT COUNT(*) FROM (SELECT 1 FROM stock_copytest LIMIT 10000000) AS a;

Stock_copytest is larger than RAM size, that's why I used the LIMIT to 
make the result set memory resident. That had the side effect that 
apparently the limit-node kept the single buffer pinned which defeated 
the buffer ring completely. To avoid issues like that we apparently want 
to use 2-4 buffers instead of just 1.

I'll review my test methodology and keep testing...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"CK Tan"
Date:
The patch has no effect on scans that do updates. The  
KillAndReadBuffer routine does not force out a buffer if the dirty  
bit is set. So updated pages revert to the current performance  
characteristics.

-cktan
GreenPlum, Inc.

On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote:

> Zeugswetter Andreas ADI SD wrote:
>>> In reference to the seq scans roadmap, I have just submitted a  
>>> patch that addresses some of the concerns.
>>>
>>> The patch does this:
>>>
>>> 1. for small relation (smaller than 60% of bufferpool), use the  
>>> current logic 2. for big relation:
>>>     - use a ring buffer in heap scan
>>>     - pin first 12 pages when scan starts
>>>     - on consumption of every 4-page, read and pin the next 4-page
>>>     - invalidate used pages of in the scan so they do not force out  
>>> other useful pages
>> A few comments regarding the effects:
>> I do not see how this speedup could be caused by readahead, so  
>> what are
>> the effects ?
>
> I was wondering that as well. We'd really need to test all the  
> changes separately to see where the improvements are really coming  
> from.
>
> Also, that patch doesn't address the VACUUM issue at all. And using  
> a small fixed size ring with scans that do updates can be  
> devastating. I'm experimenting with different ring sizes for COPY  
> at the moment. Too small ring leads to a lot of WAL flushes, it's  
> basically the same problem we have with VACUUM in CVS HEAD.
>
> -- 
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>




Re: Seq scans roadmap

From
"CK Tan"
Date:
Sorry, 16x8K page ring is too small indeed. The reason we selected 16
is because greenplum db runs on 32K page size, so we are indeed
reading 128K at a time. The #pages in the ring should be made
relative to the page size, so you achieve 128K per read.

Also agree that KillAndReadBuffer could be split into a
KillPinDontRead(), and ReadThesePinnedPages() functions. However, we
are thinking of AIO and would rather see a ReadNPagesAsync() function.

-cktan
Greenplum, Inc.

On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote:

>
>> In reference to the seq scans roadmap, I have just submitted
>> a patch that addresses some of the concerns.
>>
>> The patch does this:
>>
>> 1. for small relation (smaller than 60% of bufferpool), use
>> the current logic 2. for big relation:
>>     - use a ring buffer in heap scan
>>     - pin first 12 pages when scan starts
>>     - on consumption of every 4-page, read and pin the next 4-page
>>     - invalidate used pages of in the scan so they do not
>> force out other useful pages
>
> A few comments regarding the effects:
>
> I do not see how this speedup could be caused by readahead, so what
> are
> the effects ?
> (It should make no difference to do the CPU work for count(*)
> inbetween
> reading each block when the pages are not dirtied)
> Is the improvement solely reduced CPU because no search for a free
> buffer is needed and/or L2 cache locality ?
>
> What effect does the advance pinnig have, avoid vacuum ?
>
> A 16 x 8k page ring is too small to allow the needed IO blocksize of
> 256k.
> The readahead is done 4 x one page at a time (=32k).
> What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
> 3/4 the trail for followers and bgwriter ?
>
> I think in anticipation of doing a single IO call for more that one
> page, the KillAndReadBuffer function should be split into two
> parts. One
> that does the killing
> for n pages, and one that does the reading for n pages.
> Killing n before reading n would also have the positive effect of
> grouping perhaps needed writes (not interleaving them with the reads).
>
> I think the 60% Nbuffers is a very good starting point. I would only
> introduce a GUC when we see evidence that it is needed (I agree with
> Simon's partitioning comments, but I'd still wait and see).
>
> Andreas
>




Re: Seq scans roadmap

From
Zeugswetter Andreas ADI SD
Date:
> Sorry, 16x8K page ring is too small indeed. The reason we
> selected 16 is because greenplum db runs on 32K page size, so
> we are indeed reading 128K at a time. The #pages in the ring
> should be made relative to the page size, so you achieve 128K
> per read.

Ah, ok. New disks here also have a peak at 128k with no other concurrent
IO.
Writes benefit from larger blocksizes though, 512k and more.
Reads with other concurrent IO might also benefit from larger
blocksizes.

Comment to all: to test optimal blocksizes make sure you have other
concurrent IO on the disk.

> Also agree that KillAndReadBuffer could be split into a
> KillPinDontRead(), and ReadThesePinnedPages() functions.
> However, we are thinking of AIO and would rather see a
> ReadNPagesAsync() function.

Yes, you could start the aio and return an already read buffer to allow
concurrent cpu work.
However, you would still want to do blocked aio_readv calls to make sure
the physical read uses the large blocksize.
So I'd say aio would benefit from the same split.

In another posting you wrote:
> The patch has no effect on scans that do updates.
> The KillAndReadBuffer routine does not force out a buffer if
> the dirty bit is set. So updated pages revert to the current
> performance characteristics.

Yes I see, the ring slot is replaced by a standard ReadBuffer in that
case, looks good.

I still think it would be better to write out the buffers and keep them
in the ring when possible, but that seems to need locks and some sort of
synchronization with the new walwriter, so looks like a nice project for
after 8.3.

Andreas


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
I wrote:
> I'll review my test methodology and keep testing...

I ran a set of tests on a 100 warehouse TPC-C stock table that is ~3.2
GB in size and the server has 4 GB of memory. IOW the table fits in OS
cache, but not in shared_buffers (set at 1 GB).

copy - COPY from a file
select - SELECT COUNT(*) FROM stock
vacuum - VACUUM on a clean table, effectively a read-only operation
vacuum_hintbits - VACUUM on a table with no dead tuples, but hint bits
need to be set on every page
vacuum_dirty - VACUUM with exactly 1 dead tuple per page,

The number after the test name is the ring size used.

There was no indexes on the table, which means that the vacuum tests
only had to do one pass. The 1st vacuum phase of a real-world table is
like a mixture of vacuum- and vacuum_hintbits-tests, and 2nd phase is
like the vacuum_dirty test.

  copy-1            | 00:31:47.042365
  copy-2            | 00:17:57.630772
  copy-4            | 00:17:55.041794
  copy-8            | 00:08:31.014009
  copy-16           | 00:05:38.39848
  copy-32           | 00:05:52.295512
  copy-64           | 00:06:08.404646
  copy-128          | 00:05:05.032448
  copy-256          | 00:05:48.573146
  copy-512          | 00:04:56.098752
  copy-1024         | 00:05:27.05316
  select-4          | 00:00:04.344873
  select-4          | 00:00:02.2498
  select-1          | 00:00:08.754011
  select-1          | 00:00:10.521174
  select-1          | 00:00:10.819376
  select-1          | 00:00:14.818831
  select-1          | 00:00:14.893562
  select-1          | 00:00:16.973934
  select-2          | 00:00:15.722776
  select-2          | 00:00:02.291078
  select-2          | 00:00:02.230167
  select-4          | 00:00:02.232935
  select-8          | 00:00:02.238791
  select-16         | 00:00:02.245566
  select-32         | 00:00:02.267158
  select-64         | 00:00:02.311878
  select-128        | 00:00:02.487086
  select-256        | 00:00:02.764085
  select-512        | 00:00:03.161025
  select-1024       | 00:00:03.387246
  vacuum-1          | 00:00:01.843337
  vacuum-2          | 00:00:01.612738
  vacuum-4          | 00:00:01.6304
  vacuum-8          | 00:00:01.655126
  vacuum-16         | 00:00:01.641808
  vacuum-32         | 00:00:01.664108
  vacuum-64         | 00:00:01.729106
  vacuum-128        | 00:00:01.879023
  vacuum-256        | 00:00:02.218303
  vacuum-512        | 00:00:02.569571
  vacuum-1024       | 00:00:02.791995
  vacuum_dirty-1    | 00:24:15.424337
  vacuum_dirty-2    | 00:13:26.981835
  vacuum_dirty-4    | 00:08:07.260113
  vacuum_dirty-8    | 00:05:24.1476
  vacuum_dirty-16   | 00:03:52.690336
  vacuum_dirty-32   | 00:02:40.759203
  vacuum_dirty-64   | 00:02:45.14425
  vacuum_dirty-128  | 00:02:46.718922
  vacuum_dirty-256  | 00:02:43.797785
  vacuum_dirty-512  | 00:02:36.363763
  vacuum_dirty-1024 | 00:02:32.767481
  vacuum_hintbits-1    | 00:00:37.847935
  vacuum_hintbits-2    | 00:00:38.788662
  vacuum_hintbits-4    | 00:00:43.554029
  vacuum_hintbits-8    | 00:00:42.040379
  vacuum_hintbits-16   | 00:00:44.187508
  vacuum_hintbits-32   | 00:00:38.252052
  vacuum_hintbits-64   | 00:00:37.920379
  vacuum_hintbits-128  | 00:00:38.463007
  vacuum_hintbits-256  | 00:00:38.157724
  vacuum_hintbits-512  | 00:00:38.309285
  vacuum_hintbits-1024 | 00:00:39.178738

I ran the some of the select tests multiple times because the behavior
changed when the test was repeated. I don't know what's going on in the
select-1 test, it looks like the same effect I had with the more complex
query involving a LIMIT-node, but this time I'm just doing a plain
SELECT COUNT(*). I ran the test script multiple times; the results shown
above are copy-pasted from one particular run but the numbers didn't
change much from run to run. In particular, the run times for the
select-1 test really do increase as you repeat the test many times. The
copy results seem to vary quite a bit, though.

For comparison, here's the test results with vanilla CVS HEAD:

  copy-head         | 00:06:21.533137
  copy-head         | 00:05:54.141285
  select-head       | 00:00:16.213693
  select-head       | 00:00:18.500792
  vacuum-head       | 00:00:12.843479
  vacuum-head       | 00:00:08.719845
  vacuum_dirty-head | 00:22:02.533553
  vacuum_dirty-head | 00:22:02.852786
  vacuum_hintbits-head | 00:00:38.278701
  vacuum_hintbits-head | 00:00:35.226191

Looking at the results, it seems that using a fixed sized ring of 32
pages hits the sweet spot on all tests. I wonder if that holds on other
hardware.

The test scripts I used are attached. I used a modified DBT-2 schema and
dump file, so you'll need to replace that with some other large table to
run it. I would appreciate it if others would repeat the tests on other
hardware to get a bigger sample.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
/*
drop table if exists stock100;
create table stock100
(
  s_i_id        integer
, s_w_id        smallint
, s_quantity    smallint
, s_order_cnt   smallint            -- not listed as a monetary value
, s_remote_cnt  smallint            -- not listed as a monetary value
, s_ytd         integer             -- not listed as a monetary value
, s_dist_01     char(24)
, s_dist_02     char(24)
, s_dist_03     char(24)
, s_dist_04     char(24)
, s_dist_05     char(24)
, s_dist_06     char(24)
, s_dist_07     char(24)
, s_dist_08     char(24)
, s_dist_09     char(24)
, s_dist_10     char(24)
, s_data        text                -- varchar(50)
);


drop table if exists testresult;
CREATE TABLE testresult (
  description text NOT NULL,
  begints timestamp DEFAULT (now()) NOT NULL,
  endts timestamp);
*/
---
/*
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('copy-1');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

---

TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 2;
INSERT INTO testresult (description) VALUES ('copy-2');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

---

TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 4;
INSERT INTO testresult (description) VALUES ('copy-4');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

---

TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 8;
INSERT INTO testresult (description) VALUES ('copy-8');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

---

TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 16;
INSERT INTO testresult (description) VALUES ('copy-16');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 32;
INSERT INTO testresult (description) VALUES ('copy-32');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 64;
INSERT INTO testresult (description) VALUES ('copy-64');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 128;
INSERT INTO testresult (description) VALUES ('copy-128');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 256;
INSERT INTO testresult (description) VALUES ('copy-256');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 512;
INSERT INTO testresult (description) VALUES ('copy-512');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;
----
TRUNCATE stock100; CHECKPOINT;
SET scan_recycle_buffers = 1024;
INSERT INTO testresult (description) VALUES ('copy-1024');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;
*/

----
/*
SELECT COUNT(*) FROM stock100; -- set hint bits

CHECKPOINT;
SET scan_recycle_buffers = 4;
INSERT INTO testresult (description) VALUES ('select-4');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 4;
INSERT INTO testresult (description) VALUES ('select-4');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;



CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('select-1');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 2;
INSERT INTO testresult (description) VALUES ('select-2');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
SET scan_recycle_buffers = 2;
INSERT INTO testresult (description) VALUES ('select-2');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;



CHECKPOINT;
SET scan_recycle_buffers = 2;
INSERT INTO testresult (description) VALUES ('select-2');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 4;
INSERT INTO testresult (description) VALUES ('select-4');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 8;
INSERT INTO testresult (description) VALUES ('select-8');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 16;
INSERT INTO testresult (description) VALUES ('select-16');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 32;
INSERT INTO testresult (description) VALUES ('select-32');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 64;
INSERT INTO testresult (description) VALUES ('select-64');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 128;
INSERT INTO testresult (description) VALUES ('select-128');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 256;
INSERT INTO testresult (description) VALUES ('select-256');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 512;
INSERT INTO testresult (description) VALUES ('select-512');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

CHECKPOINT;
SET scan_recycle_buffers = 1024;
INSERT INTO testresult (description) VALUES ('select-1024');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;
*/
----
/*
------- VACUUM tests -------

CHECKPOINT;
SET scan_recycle_buffers = 1;
INSERT INTO testresult (description) VALUES ('vacuum-1');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 2;
INSERT INTO testresult (description) VALUES ('vacuum-2');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 4;
INSERT INTO testresult (description) VALUES ('vacuum-4');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 8;
INSERT INTO testresult (description) VALUES ('vacuum-8');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 16;
INSERT INTO testresult (description) VALUES ('vacuum-16');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 32;
INSERT INTO testresult (description) VALUES ('vacuum-32');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 64;
INSERT INTO testresult (description) VALUES ('vacuum-64');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 128;
INSERT INTO testresult (description) VALUES ('vacuum-128');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 256;
INSERT INTO testresult (description) VALUES ('vacuum-256');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 512;
INSERT INTO testresult (description) VALUES ('vacuum-512');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----
CHECKPOINT;
SET scan_recycle_buffers = 1024;
INSERT INTO testresult (description) VALUES ('vacuum-1024');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 1;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-1');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 2;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-2');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 4;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-4');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 8;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-8');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 16;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-16');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 32;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-32');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 64;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-64');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 128;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-128');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 256;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-256');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 512;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-512');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
SET scan_recycle_buffers = 1024;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
SET scan_recycle_buffers = 1024;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-1024');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

*/

SET scan_recycle_buffers = 1024;
DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;


SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 1;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-1');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 2;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-2');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 4;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-4');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 8;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-8');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 16;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-16');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 32;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-32');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 64;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-64');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 128;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-128');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 256;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-256');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 512;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-512');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

SET scan_recycle_buffers = 1024;
BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
SET scan_recycle_buffers = 1024;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-1024');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


SELECT description, endts-begints FROM testresult;/*
drop table if exists stock100;
create table stock100
(
  s_i_id        integer
, s_w_id        smallint
, s_quantity    smallint
, s_order_cnt   smallint            -- not listed as a monetary value
, s_remote_cnt  smallint            -- not listed as a monetary value
, s_ytd         integer             -- not listed as a monetary value
, s_dist_01     char(24)
, s_dist_02     char(24)
, s_dist_03     char(24)
, s_dist_04     char(24)
, s_dist_05     char(24)
, s_dist_06     char(24)
, s_dist_07     char(24)
, s_dist_08     char(24)
, s_dist_09     char(24)
, s_dist_10     char(24)
, s_data        text                -- varchar(50)
);

-- drop table if exists testresult;
CREATE TABLE testresult (
  description text NOT NULL,
  begints timestamp DEFAULT (now()) NOT NULL,
  endts timestamp);

---

TRUNCATE stock100; CHECKPOINT;
INSERT INTO testresult (description) VALUES ('copy-head');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;

---

TRUNCATE stock100; CHECKPOINT;
INSERT INTO testresult (description) VALUES ('copy-head');
COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data';
UPDATE testresult SET endts = now() WHERE endts IS NULL;


SELECT COUNT(*) FROM stock100; -- set hint bits

CHECKPOINT;
INSERT INTO testresult (description) VALUES ('select-head');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

CHECKPOINT;
INSERT INTO testresult (description) VALUES ('select-head');
SELECT COUNT(*) FROM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

------- VACUUM tests -------

CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum-head');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum-head');
VACUUM stock100;
UPDATE testresult SET endts = now() WHERE endts IS NULL;

----

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-head');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;
DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_dirty-head');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;
*/

DROP TABLE IF EXISTS stock100_copy;
SELECT * INTO stock100_copy FROM stock100;

BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-head');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK;
CHECKPOINT;
INSERT INTO testresult (description) VALUES ('vacuum_hintbits-head');
VACUUM VERBOSE stock100_copy;
UPDATE testresult SET endts = now() WHERE endts IS NULL;


SELECT description, endts-begints FROM testresult;

Re: Seq scans roadmap

From
"Simon Riggs"
Date:
On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
> For comparison, here's the test results with vanilla CVS HEAD:
> 
>   copy-head         | 00:06:21.533137
>   copy-head         | 00:05:54.141285 

I'm slightly worried that the results for COPY aren't anywhere near as
good as the SELECT and VACUUM results. It isn't clear from those numbers
that the benefit really is significant.

Are you thinking that having COPY avoid cache spoiling is a benefit just
of itself? Or do you see a pattern of benefit from your other runs?

(BTW what was wal_buffers set to? At least twice the ring buffer size,
hopefully).

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
Hi Simon,

On 5/12/07 12:35 AM, "Simon Riggs" <simon@enterprisedb.com> wrote:

> I'm slightly worried that the results for COPY aren't anywhere near as
> good as the SELECT and VACUUM results. It isn't clear from those numbers
> that the benefit really is significant.

COPY is bottlenecked on datum formation and format translation with very low
performance, so I don't think we should expect the ring buffer to make much
of a dent.

- Luke 




Re: Seq scans roadmap

From
"CK Tan"
Date:
Hi All,

COPY/INSERT are also bottlenecked on record at a time insertion into  
heap, and in checking for pre-insert trigger, post-insert trigger and  
constraints.

To speed things up, we really need to special case insertions without  
triggers and constraints, [probably allow for unique constraints],  
and make these insertions to go into heap N tuples at a time. With  
this change, comes the benefit of optimizing REDO log to log multiple  
inserts or even logging a whole new heap page that gets filled in a  
single WAL record.

Those with triggers and other constraints would still have to go in  
one at a time because of the trigger/constraints semantics.

It seems to me that dirty pages should be written out by the bg  
writer instead of circumventing it using ring buffer. If it is slow,  
we should change bg writer.

-cktan

On May 12, 2007, at 8:42 AM, Luke Lonergan wrote:

> Hi Simon,
>
> On 5/12/07 12:35 AM, "Simon Riggs" <simon@enterprisedb.com> wrote:
>
>> I'm slightly worried that the results for COPY aren't anywhere  
>> near as
>> good as the SELECT and VACUUM results. It isn't clear from those  
>> numbers
>> that the benefit really is significant.
>
> COPY is bottlenecked on datum formation and format translation with  
> very low
> performance, so I don't think we should expect the ring buffer to  
> make much
> of a dent.
>
> - Luke
>




Re: Seq scans roadmap

From
Tom Lane
Date:
"CK Tan" <cktan@greenplum.com> writes:
> COPY/INSERT are also bottlenecked on record at a time insertion into  
> heap, and in checking for pre-insert trigger, post-insert trigger and  
> constraints.

> To speed things up, we really need to special case insertions without  
> triggers and constraints, [probably allow for unique constraints],  

Do you have any profiling data to back up these assertions?  I haven't
noticed that firing zero tuples takes any visible percentage of COPY
time.
        regards, tom lane


Re: Seq scans roadmap

From
"CK Tan"
Date:
Sorry, I should have been clearer. I meant because we need to check  
for trigger firing pre/post insertion, and the trigger definitions  
expect tuples to be inserted one by one, therefore we cannot insert N- 
tuples at a time into the heap. Checking for triggers itself is not  
taking up much CPU at all. If we could predetermine that there is not  
any triggers for a relation, inserts into that relation could then  
follow a different path that inserts N-tuples at a time.

Regards,
-cktan

On May 13, 2007, at 4:54 PM, Tom Lane wrote:

> "CK Tan" <cktan@greenplum.com> writes:
>> COPY/INSERT are also bottlenecked on record at a time insertion into
>> heap, and in checking for pre-insert trigger, post-insert trigger and
>> constraints.
>
>> To speed things up, we really need to special case insertions without
>> triggers and constraints, [probably allow for unique constraints],
>
> Do you have any profiling data to back up these assertions?  I haven't
> noticed that firing zero tuples takes any visible percentage of COPY
> time.
>
>             regards, tom lane
>




Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
>> For comparison, here's the test results with vanilla CVS HEAD:
>>
>>   copy-head         | 00:06:21.533137
>>   copy-head         | 00:05:54.141285 
> 
> I'm slightly worried that the results for COPY aren't anywhere near as
> good as the SELECT and VACUUM results. It isn't clear from those numbers
> that the benefit really is significant.

Agreed, the benefit isn't clear.

> Are you thinking that having COPY avoid cache spoiling is a benefit just
> of itself? Or do you see a pattern of benefit from your other runs?

I think it's worth having just to avoid cache spoiling. I wouldn't 
bother otherwise, but since we have the infrastructure for vacuum and 
large seqscans, we might as well use it for COPY as well.

> (BTW what was wal_buffers set to? At least twice the ring buffer size,
> hopefully).

Good question. [checks]. wal_buffers was set to 128KB. I tried raising 
it to 1MB, but it didn't make any difference.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Just to keep you guys informed, I've been busy testing and pondering 
over different buffer ring strategies for vacuum, seqscans and copy. 
Here's what I'm going to do:

Use a fixed size ring. Fixed as in doesn't change after the ring is 
initialized, however different kinds of scans use differently sized rings.

I said earlier that it'd be invasive change to see if a buffer needs a 
WAL flush and choose another victim if that's the case. I looked at it 
again and found a pretty clean way of doing that, so I took that 
approach for seq scans.

1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to 
give the L2 cache benefits and keep cache pollution low, but at the same 
time it's large enough that it keeps the need to WAL flush reasonable 
(1/32 of what we do now).

2. For sequential scans, also use a ring of 32 buffers, but whenever a 
buffer in the ring would need a WAL flush to recycle, we throw it out of 
the buffer ring instead. On read-only scans (and scans that only update 
hint bit) this gives the L2 cache benefits and doesn't pollute the 
buffer cache. On bulk updates, it's effectively the current behavior. On 
scans that do some updates, it's something in between. In all cases it 
should be no worse than what we have now. 32 buffers should be large 
enough to leave a "cache trail" for Jeff's synchronized scans to work.

3. For COPY that doesn't write WAL, use the same strategy as for 
sequential scans. This keeps the cache pollution low and gives the L2 
cache benefits.

4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We 
want to use a ring that can accommodate 1 WAL segment worth of data, to 
avoid having to do any extra WAL flushes, and the WAL segment size is 
2048 pages in the default configuration.

Some alternatives I considered but rejected:

* Instead of throwing away dirtied buffers in seq scans, accumulate them 
in another fixed sized list. When the list gets full, do a WAL flush and 
put them to the shared freelist or a backend-private freelist. That 
would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs, 
and it could be used for vacuum as well. I think this would be the 
optimal algorithm but I don't feel like inventing something that 
complicated at this stage anymore. Maybe for 8.4.

* Using a different sized ring for 1st and 2nd vacuum phase. Decided 
that it's not worth the trouble, the above is already an order of 
magnitude better than the current behavior.


I'm going to rerun the performance tests I ran earlier with new patch, 
tidy it up a bit, and submit it in the next few days. This turned out to 
be even more laborious patch to review than I thought. While the patch 
is short and in the end turned out to be very close to Simon's original 
patch, there's many different usage scenarios that need to be catered 
for and tested.

I still need to check the interaction with Jeff's patch. This is close 
enough to Simon's original patch that I believe the results of the tests 
Jeff ran earlier are still valid.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
Heikki,

32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
effect.

How about using 256/blocksize?

- Luke

> -----Original Message-----
> From: Heikki Linnakangas [mailto:hlinnaka@gmail.com] On
> Behalf Of Heikki Linnakangas
> Sent: Tuesday, May 15, 2007 2:32 AM
> To: PostgreSQL-development
> Cc: Simon Riggs; Zeugswetter Andreas ADI SD; CK.Tan; Luke
> Lonergan; Jeff Davis
> Subject: Re: [HACKERS] Seq scans roadmap
>
> Just to keep you guys informed, I've been busy testing and
> pondering over different buffer ring strategies for vacuum,
> seqscans and copy.
> Here's what I'm going to do:
>
> Use a fixed size ring. Fixed as in doesn't change after the
> ring is initialized, however different kinds of scans use
> differently sized rings.
>
> I said earlier that it'd be invasive change to see if a
> buffer needs a WAL flush and choose another victim if that's
> the case. I looked at it again and found a pretty clean way
> of doing that, so I took that approach for seq scans.
>
> 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small
> enough to give the L2 cache benefits and keep cache pollution
> low, but at the same time it's large enough that it keeps the
> need to WAL flush reasonable
> (1/32 of what we do now).
>
> 2. For sequential scans, also use a ring of 32 buffers, but
> whenever a buffer in the ring would need a WAL flush to
> recycle, we throw it out of the buffer ring instead. On
> read-only scans (and scans that only update hint bit) this
> gives the L2 cache benefits and doesn't pollute the buffer
> cache. On bulk updates, it's effectively the current
> behavior. On scans that do some updates, it's something in
> between. In all cases it should be no worse than what we have
> now. 32 buffers should be large enough to leave a "cache
> trail" for Jeff's synchronized scans to work.
>
> 3. For COPY that doesn't write WAL, use the same strategy as
> for sequential scans. This keeps the cache pollution low and
> gives the L2 cache benefits.
>
> 4. For COPY that writes WAL, use a large ring of 2048-4096
> buffers. We want to use a ring that can accommodate 1 WAL
> segment worth of data, to avoid having to do any extra WAL
> flushes, and the WAL segment size is
> 2048 pages in the default configuration.
>
> Some alternatives I considered but rejected:
>
> * Instead of throwing away dirtied buffers in seq scans,
> accumulate them in another fixed sized list. When the list
> gets full, do a WAL flush and put them to the shared freelist
> or a backend-private freelist. That would eliminate the cache
> pollution of bulk DELETEs and bulk UPDATEs, and it could be
> used for vacuum as well. I think this would be the optimal
> algorithm but I don't feel like inventing something that
> complicated at this stage anymore. Maybe for 8.4.
>
> * Using a different sized ring for 1st and 2nd vacuum phase.
> Decided that it's not worth the trouble, the above is already
> an order of magnitude better than the current behavior.
>
>
> I'm going to rerun the performance tests I ran earlier with
> new patch, tidy it up a bit, and submit it in the next few
> days. This turned out to be even more laborious patch to
> review than I thought. While the patch is short and in the
> end turned out to be very close to Simon's original patch,
> there's many different usage scenarios that need to be
> catered for and tested.
>
> I still need to check the interaction with Jeff's patch. This
> is close enough to Simon's original patch that I believe the
> results of the tests Jeff ran earlier are still valid.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>
>



Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Luke Lonergan wrote:
> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> effect.
> 
> How about using 256/blocksize?

Sounds reasonable. We need to check the effect on the synchronized 
scans, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
Jeff Davis
Date:
On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
> Luke Lonergan wrote:
> > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> > effect.
> > 
> > How about using 256/blocksize?
> 
> Sounds reasonable. We need to check the effect on the synchronized 
> scans, though.
> 

I am a little worried that there will be greater differences in position
as the number of scans increase. If we have only 8 buffers and several
scans progressing, will they all be able to stay within a few buffers of
eachother at any given time?

Also, with 8 buffers, that means each scan must report every 4 pages at
most (and maybe every page), which increases lock contention (the new
design Heikki and I discussed requires a lock every time a backend
reports its position).

Regards,Jeff Davis



Re: Seq scans roadmap

From
"Jim C. Nasby"
Date:
On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
> > Luke Lonergan wrote:
> > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> > > effect.
> > > 
> > > How about using 256/blocksize?
> > 
> > Sounds reasonable. We need to check the effect on the synchronized 
> > scans, though.
> > 
> 
> I am a little worried that there will be greater differences in position
> as the number of scans increase. If we have only 8 buffers and several
> scans progressing, will they all be able to stay within a few buffers of
> eachother at any given time?
> 
> Also, with 8 buffers, that means each scan must report every 4 pages at
> most (and maybe every page), which increases lock contention (the new
> design Heikki and I discussed requires a lock every time a backend
> reports its position).

Given that spoiling the L2 cache is a trivial cost compared to extra
physical IO, ISTM we should go with a largish ring for sync scans. What
do you think would be the ideal size? 32 buffers?
-- 
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Seq scans roadmap

From
Heikki Linnakangas
Date:
Jeff Davis wrote:
> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
>> Luke Lonergan wrote:
>>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
>>> effect.
>>>
>>> How about using 256/blocksize?
>> Sounds reasonable. We need to check the effect on the synchronized 
>> scans, though.
>>
> 
> I am a little worried that there will be greater differences in position
> as the number of scans increase. If we have only 8 buffers and several
> scans progressing, will they all be able to stay within a few buffers of
> eachother at any given time?

I'm not sure. Needs testing. Assuming the scan leaves behind a cache 
trail in the OS cache as well, it might not be that bad if a scan 
joining the party starts a little bit behind.

> Also, with 8 buffers, that means each scan must report every 4 pages at
> most (and maybe every page), which increases lock contention (the new
> design Heikki and I discussed requires a lock every time a backend
> reports its position).

Keep in mind that processing a 32K page takes longer than processing an 
8K page.

But we'll see..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Seq scans roadmap

From
"Zeugswetter Andreas ADI SD"
Date:
> > > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2
> > > > cache effect.

I'd say in a scenario where 32k pages are indicated you will also want
larger than average L2 caches.

> > > >
> > > > How about using 256/blocksize?

The reading ahead uses 1/4 ring size. To the best of our knowledge, this
1/4 needs to be 128k for reading.
So I'd say we need 512/blocksize.

Andreas


Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
I think the analysis on syncscan needs to take the external I/O cache into
account.  I believe it is not necessary or desirable to keep the scans in
lock step within the PG bufcache.

The main benefit of a sync scan will be the ability to start the scan where
other scans have already filled the I/O cache with useful blocks.  This will
require some knowledge of the size of the I/O cache by the syncscan logic,
but that's where the largest amount of I/O cache memory (by far) is located.

- Luke  


On 5/15/07 3:34 PM, "Jim C. Nasby" <decibel@decibel.org> wrote:

> On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
>> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
>>> Luke Lonergan wrote:
>>>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
>>>> effect.
>>>> 
>>>> How about using 256/blocksize?
>>> 
>>> Sounds reasonable. We need to check the effect on the synchronized
>>> scans, though.
>>> 
>> 
>> I am a little worried that there will be greater differences in position
>> as the number of scans increase. If we have only 8 buffers and several
>> scans progressing, will they all be able to stay within a few buffers of
>> eachother at any given time?
>> 
>> Also, with 8 buffers, that means each scan must report every 4 pages at
>> most (and maybe every page), which increases lock contention (the new
>> design Heikki and I discussed requires a lock every time a backend
>> reports its position).
> 
> Given that spoiling the L2 cache is a trivial cost compared to extra
> physical IO, ISTM we should go with a largish ring for sync scans. What
> do you think would be the ideal size? 32 buffers?




Re: Seq scans roadmap

From
Jeff Davis
Date:
On Wed, 2007-05-16 at 10:31 -0700, Luke Lonergan wrote:
> I think the analysis on syncscan needs to take the external I/O cache into
> account.  I believe it is not necessary or desirable to keep the scans in
> lock step within the PG bufcache.

I partially agree. I don't think we need any huge amount of shared
buffers for the scans to use, and the scans should be able to catch up
by using the OS buffer cache (which is still faster than fetching from
disk). 

However, as I said before, I'm worried that, if the ring is too small,
it might not work to my expectations. I will try to test this to
eliminate my doubts.

> The main benefit of a sync scan will be the ability to start the scan where
> other scans have already filled the I/O cache with useful blocks.  This will
> require some knowledge of the size of the I/O cache by the syncscan logic,
> but that's where the largest amount of I/O cache memory (by far) is located.
> 

I don't think it's necessarily the largest "by far". However, it may be
the largest.

If you mean that the main benefit of sync scans is to make use of blocks
that happen to be in cache before the scan began, I disagree. I think
that's a possible benefit, but I was unable to show any huge benefit in
my tests (someone may be able to on different hardware with different
test cases).

The main benefits that I see are:(1) reduce total number of blocks read from disk by making use of
blocks as they are read by another concurrent seqscan.(2) eliminate the need for random I/O on concurrent sequential
scans.

I like the idea of using already-in-cache blocks. The code is there and
it works, but I just didn't see the benefits yet. After I get any issues
with this patch resolved and the reviewers are satisfied, I'd like to
work on it.

Regards,Jeff Davis






Re: Seq scans roadmap

From
"Luke Lonergan"
Date:
Hi Jeff,

On 5/16/07 4:56 PM, "Jeff Davis" <pgsql@j-davis.com> wrote:

>> The main benefit of a sync scan will be the ability to start the scan where
>> other scans have already filled the I/O cache with useful blocks.  This will
>> require some knowledge of the size of the I/O cache by the syncscan logic,
>> but that's where the largest amount of I/O cache memory (by far) is located.

> I don't think it's necessarily the largest "by far". However, it may be
> the largest.

Compare the size of a 32 block ring buffer (between 256KB and 1024KB) and
16,000,000 KB of RAM on a common server, automatically used to maximum
extent by the OS dynamic I/O caching. 
> If you mean that the main benefit of sync scans is to make use of blocks
> that happen to be in cache before the scan began, I disagree.

That's not what I meant.

> I think
> that's a possible benefit, but I was unable to show any huge benefit in
> my tests (someone may be able to on different hardware with different
> test cases).

I agree, I don't think this is worth pursuing.
> The main benefits that I see are:
>  (1) reduce total number of blocks read from disk by making use of
> blocks as they are read by another concurrent seqscan.
>  (2) eliminate the need for random I/O on concurrent sequential scans.

Yes on (1), but with (2), again, the OS prefetch reduces the seeking to a
minimal level.

With (1), we just have to define the meaning of "concurrent" to be "within
the span of the OS I/O cache" and we're describing the same effect.

- Luke