Thread: old synchronized scan patch

old synchronized scan patch

From

Jeff Davis

Date:

04 December 2006, 15:08:05

Now that 8.3 is open, I was considering a revival of this old patch:

http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

I could probably clean it up with a little help from someone on this
list.

Advantages of this patch: If multiple seq scans are going in parallel on
the same table, it can drastically increase cache hit rate by
synchronizing the scans. In theory, it should also prevent the problem
where sequential scans can turn into random access from the disk.

Disadvantages:
* sequential scans no longer return results in a deterministic order. As
I understand it, this has effects on the regression tests and possibly a
few other locations in the code.
* While the in the use case, it should improve performance quite a lot.
However, the use case is quite narrow: you need to be running sequential
scans that are reading tuples from the same table at the same time. That
means the table can't fit into memory, but must be small enough that it
is sane to be executing multiple sequential scans on it in parallel.

Is there some interest in this patch? How would I go about proving
whether it's useful enough or not?

Regards,Jeff Davis

Re: old synchronized scan patch

From

"Luke Lonergan"

Date:

04 December 2006, 19:04:14

Jeff,

On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:

> Now that 8.3 is open, I was considering a revival of this old patch:
> 
> http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php
> 
> I could probably clean it up with a little help from someone on this
> list.
> 
> Advantages of this patch: If multiple seq scans are going in parallel on
> the same table, it can drastically increase cache hit rate by
> synchronizing the scans. In theory, it should also prevent the problem
> where sequential scans can turn into random access from the disk.
> 
> Disadvantages:
> * sequential scans no longer return results in a deterministic order. As
> I understand it, this has effects on the regression tests and possibly a
> few other locations in the code.
> * While the in the use case, it should improve performance quite a lot.
> However, the use case is quite narrow: you need to be running sequential
> scans that are reading tuples from the same table at the same time. That
> means the table can't fit into memory, but must be small enough that it
> is sane to be executing multiple sequential scans on it in parallel.
> 
> Is there some interest in this patch?

Yes.

For a scenario of interest let's assume:
1 - the Postgres cache is "scan resistant" such that a seq scan of a large
table is not mapped into the block cache.
2 - the O/S cache is not "scan resistant" so that it will map all blocks
read through the cache
3 - there is I/O prefetch in the OS that keeps the I/O subsystem efficient
under multi-user sequential scanning of different tables/files

With (2), if you issue 5 queries that all seq scan a table that is much
larger than memory, they will complete in nearly the time of one query.
I've tested this up to 1,000 simultaneous queries on Linux and get huge
benefits.  This scenario is only sometimes helpful in real production usage.

Given (2), I think sync scan would only have a multi-user benefit when the
scans occur in periods separated by more than the amount of time it takes to
scan a cache size worth of data from disk.  For example: If there is 16GB of
RAM, the OS I/O cache might have room for 14GB of blocks.  On a good (ahem)
system it might take 14 seconds to scan that amount of data into cache from
disk (yes, 1GB/s).  If a new backend begins scanning after that 14 seconds,
it won't benefit from the data in the cache because the new data from disk
replaces the oldest data in cache and so on.

Since we're talking O(seconds) separation between user queries, I think we
could have a benefit.

Where I think sync scan could have a big benefit is for multi-user business
intelligence workloads where there are a few huge fact tables of interest to
a wide audience.  Example: 5 business analysts come to work at 9AM and start
ad-hoc queries expected to run in about 15 minutes each.  Each query
sequential scans a 10 billion row fact table once, which takes 10 minutes of
the query runtime.  With sync scan the last one completes in 35 minutes.
Without sync scan the last completes in 75 minutes.  In this case sync scan
significantly improves the experience of 5 people.

> How would I go about proving whether it's useful enough or not?

Can you run the above scenario on a table whose size is ten times the memory
on the machine?  As a simple starting point, a simple "SELECT COUNT(*) FROM
BIGTABLE" should be sufficient, but the scans need to be separated by enough
time to invalidate the OS I/O cache.

- Luke

Re: old synchronized scan patch

From

Jeff Davis

Date:

04 December 2006, 19:38:42

On Mon, 2006-12-04 at 15:03 -0800, Luke Lonergan wrote:
> Jeff,
> > Now that 8.3 is open, I was considering a revival of this old patch:
> > 
> > http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php
> > 
> > I could probably clean it up with a little help from someone on this
> > list.
> > 
> > 
> > Is there some interest in this patch?
> 
> Yes.
> 

<snip>

> Where I think sync scan could have a big benefit is for multi-user business
> intelligence workloads where there are a few huge fact tables of interest to
> a wide audience.  Example: 5 business analysts come to work at 9AM and start
> ad-hoc queries expected to run in about 15 minutes each.  Each query
> sequential scans a 10 billion row fact table once, which takes 10 minutes of
> the query runtime.  With sync scan the last one completes in 35 minutes.
> Without sync scan the last completes in 75 minutes.  In this case sync scan
> significantly improves the experience of 5 people.
> 

Thank you for your input. 

> > How would I go about proving whether it's useful enough or not?
> 
> Can you run the above scenario on a table whose size is ten times the memory
> on the machine?  As a simple starting point, a simple "SELECT COUNT(*) FROM
> BIGTABLE" should be sufficient, but the scans need to be separated by enough
> time to invalidate the OS I/O cache.
> 

I'll try to run a test like that this week. I will be doing this on my
home hardware (bad, consumer-grade stuff), so if I gave you a patch
against HEAD could you test it against some more real hardware (and
data)?

To open up the implementation topic: 

My current patch starts a new sequential scan on a given relation at the
page of an already-running scan. It makes no guarantees that the scans
stay together, but in practice I don't think they deviate much. To try
to enforce synchronization of scanning I fear would do more harm than
good. Thoughts?

Also, it's more of a "hint" system that uses a direct mapping of the
relations Oid to hold the position of the scan. That means that, in rare
cases, the page offset could be wrong, in which case it will degenerate
to the current performance characteristics with no cost. The benefit of
doing it this way is that it's simple code, with essentially no
performance penalty or additional locking. Also, I can use a fixed
amount of shared memory (1 page is about right).

Regards,Jeff Davis

Re: old synchronized scan patch

From

Tom Lane

Date:

04 December 2006, 19:45:25

"Luke Lonergan" <llonergan@greenplum.com> writes:
> On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:
>> Now that 8.3 is open, I was considering a revival of this old patch:
>> http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

> Where I think sync scan could have a big benefit is for multi-user business
> intelligence workloads where there are a few huge fact tables of interest to
> a wide audience.  Example: 5 business analysts come to work at 9AM and start
> ad-hoc queries expected to run in about 15 minutes each.

The problem I've got with this is that the use-case seems a bit narrow
(maybe not to Greenplum, but compared to the general Postgres audience)
whereas the proposed patch is invasive in terms of changing
well-established behavior, and what's worse, potentially *reducing*
performance for average workloads.  In particular I'm concerned about
the shared-memory area where the scan status is stored becoming a
contention bottleneck.  It would presumably have access demand
comparable to the bufmgr, ie one hit per page scanned, and we already
know that it's extremely hard to keep bufmgr from being a major source
of contention.

One thing you might consider doing is to have the patch do nothing when
scanning tables that are less than, perhaps, 10x shared_buffers long.
This would at least avoid the regression-test-breakage problem (at the
cost of not having the patch tested at all by said tests :-().

Anyway I think the major stumbling block is to be able to show that the
patch has only negligible performance impact in cases where it's not
able to provide a benefit.
        regards, tom lane

Re: old synchronized scan patch

From

Jeff Davis

Date:

04 December 2006, 20:06:43

On Mon, 2006-12-04 at 18:45 -0500, Tom Lane wrote:
> "Luke Lonergan" <llonergan@greenplum.com> writes:
> > On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:
> >> Now that 8.3 is open, I was considering a revival of this old patch:
> >> http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php
> 
> > Where I think sync scan could have a big benefit is for multi-user business
> > intelligence workloads where there are a few huge fact tables of interest to
> > a wide audience.  Example: 5 business analysts come to work at 9AM and start
> > ad-hoc queries expected to run in about 15 minutes each.
> 

> Anyway I think the major stumbling block is to be able to show that the
> patch has only negligible performance impact in cases where it's not
> able to provide a benefit.

My patch was specifically designed with this in mind, and unless I'm
missing something, should have negligible performance impact.

The initial implementation directly maps the Oid of the relation to a
location in an shared-memory table (not table as in relation, but table
as in area of memory). Then, it stores the page offset for that relation
each time it fetches a page for that relation in a sequential scan. When
a scan starts, it checks that number to see if it's in the valid range
for the relation's file, and if it is, that's where it starts the scan
in the file. If it's not in the valid range, it starts at offset 0.

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

In the case of a collision in the in-memory table, or if the offset in
the relation overflows the data type used to store it, it should be no
worse than current behavior.

Regards,Jeff Davis

Re: old synchronized scan patch

From

"Luke Lonergan"

Date:

04 December 2006, 20:47:16

Jeff,

On 12/4/06 3:38 PM, "Jeff Davis" <pgsql@j-davis.com> wrote:

> I'll try to run a test like that this week. I will be doing this on my
> home hardware (bad, consumer-grade stuff), so if I gave you a patch
> against HEAD could you test it against some more real hardware (and
> data)?

Sure.
> To open up the implementation topic:
> 
> My current patch starts a new sequential scan on a given relation at the
> page of an already-running scan. It makes no guarantees that the scans
> stay together, but in practice I don't think they deviate much. To try
> to enforce synchronization of scanning I fear would do more harm than
> good. Thoughts?

I think this is good enough - a "background scanner" approach would be the
logical alternative, but may not be necessary.  I suspect you are correct
about the scans being nearly synced.

This may not be the case if and when we implement a priority based
scheduler, but in that case we're already managing the throughput based on
content anyway.

> Also, it's more of a "hint" system that uses a direct mapping of the
> relations Oid to hold the position of the scan. That means that, in rare
> cases, the page offset could be wrong, in which case it will degenerate
> to the current performance characteristics with no cost. The benefit of
> doing it this way is that it's simple code, with essentially no
> performance penalty or additional locking. Also, I can use a fixed
> amount of shared memory (1 page is about right).

If I understand correctly, Tom's concern is that this page is potentially
accessed once for every page read and may consequently become very hot.  How
do you manage the scan position so this doesn't happen?  Do we have any
readahead in the I/O layer?  There is certainly readahead in the OS I/O
cache, but it's dynamic and we don't know the block position...

- Luke

Re: old synchronized scan patch

From

Jeff Davis

Date:

04 December 2006, 21:43:21

On Mon, 2006-12-04 at 16:47 -0800, Luke Lonergan wrote:
> Jeff,
> > My current patch starts a new sequential scan on a given relation at the
> > page of an already-running scan. It makes no guarantees that the scans
> > stay together, but in practice I don't think they deviate much. To try
> > to enforce synchronization of scanning I fear would do more harm than
> > good. Thoughts?
> 
> I think this is good enough - a "background scanner" approach would be the
> logical alternative, but may not be necessary.  I suspect you are correct
> about the scans being nearly synced.
> 

A background scanner requires synchronizing the backends, which can
cause all kinds of bad performance problems. Otherwise, how would you
ensure that ALL the backends that need a page get it before the train
moves on? I don't think it's necessary or desirable to have a background
scanner.

> This may not be the case if and when we implement a priority based
> scheduler, but in that case we're already managing the throughput based on
> content anyway.
> 

Right. I don't see how you'd be able to get the data to a backend that
needs it without running that backend. If it's a priority scheduler, you
may not run that backend.

> > Also, it's more of a "hint" system that uses a direct mapping of the
> > relations Oid to hold the position of the scan. That means that, in rare
> > cases, the page offset could be wrong, in which case it will degenerate
> > to the current performance characteristics with no cost. The benefit of
> > doing it this way is that it's simple code, with essentially no
> > performance penalty or additional locking. Also, I can use a fixed
> > amount of shared memory (1 page is about right).
> 
> If I understand correctly, Tom's concern is that this page is potentially
> accessed once for every page read and may consequently become very hot.  How
> do you manage the scan position so this doesn't happen?  Do we have any
> readahead in the I/O layer?  There is certainly readahead in the OS I/O
> cache, but it's dynamic and we don't know the block position...
> 

My proposal is hint-based, and I consider any reads coming from that
data page to be untrusted. I can just statically map the Oid of the
relation onto a location in the page (which may or may not collide with
another Oid). The advantage here is that I don't need to lock. There's
no contention because every access is just reading a word or two from
the shared page (or writing a word or two). Of course, it must be a
static data structure because it can't contain pointers. But 8kb should
be enough to hold information on plenty of interesting relations.

I don't trust the data because collisions are possible. If the number is
obviously wrong (higher than number of pages in relation file), a new
scan starts at offset 0. If it's wrong but that can't be detected, it
will start at that location anyway, which is OK because that arbitrary
value is no worse than the arbitrary value of 0.

The whole premise of my implementation is:
(1) Get 99% of the benefit
(2) Pay near-zero performance penalty, even in worst case.
(3) Simple code changes

If those 3 things aren't true, let me know.

The way I see it, the only real cost is that it may break things that
assume deterministic sequential scans.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Tom Lane

Date:

04 December 2006, 22:47:08

Jeff Davis <pgsql@j-davis.com> writes:
> Since I am not storing any pointers, and since the information is only
> really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch.  (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)
        regards, tom lane

Re: old synchronized scan patch

From

Hannu Krosing

Date:

05 December 2006, 03:20:35

Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Since I am not storing any pointers, and since the information is only
> > really a hint, I don't need to do any locking on that page.
> 
> If you think that, you need not bother to submit the patch.  (Hint:
> as soon as you consider more than one table at a time, it doesn't work,
> even ignoring the question of inconsistent reads.)

Why does it not work ?

Are you suggesting, that another backend can somegow see only some bits
of page number being written ?

What problems do you see in multiple table case ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: old synchronized scan patch

From

"Zeugswetter Andreas ADI SD"

Date:

05 December 2006, 06:28:34

> To open up the implementation topic:
>
> My current patch starts a new sequential scan on a given relation at
the
> page of an already-running scan.

I think it should start earlier to make use of the already cached part
of the table.
It is hard to determine, or guess how much is still in cache though.
As a start you could maybe use some (small to be safe) fraction of
effective_cache_size
or shared_buffers.

Andreas

Re: old synchronized scan patch

From

"Heikki Linnakangas"

Date:

05 December 2006, 06:38:52

Hannu Krosing wrote:
> Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:
>> Jeff Davis <pgsql@j-davis.com> writes:
>>> Since I am not storing any pointers, and since the information is only
>>> really a hint, I don't need to do any locking on that page.
>> If you think that, you need not bother to submit the patch.  (Hint:
>> as soon as you consider more than one table at a time, it doesn't work,
>> even ignoring the question of inconsistent reads.)
> 
> Why does it not work ?
> 
> Are you suggesting, that another backend can somegow see only some bits
> of page number being written ?
> 
> What problems do you see in multiple table case ?

You need to manage adding and removing relations from the shared memory 
structure. Which typically needs locking.

Assuming that relations are added or removed relatively seldom, you 
might get away with a table of (Oid, BlockNumber) pairs, working around 
the fact that the table might get messed up every now and then, and when 
it does, you'll lose the benefits until it gets corrected. But it seems 
really messy to me.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

Hannu Krosing

Date:

05 December 2006, 07:00:43

Ühel kenal päeval, T, 2006-12-05 kell 10:38, kirjutas Heikki
Linnakangas:
> Hannu Krosing wrote:
> > Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:
> >> Jeff Davis <pgsql@j-davis.com> writes:
> >>> Since I am not storing any pointers, and since the information is only
> >>> really a hint, I don't need to do any locking on that page.
> >> If you think that, you need not bother to submit the patch.  (Hint:
> >> as soon as you consider more than one table at a time, it doesn't work,
> >> even ignoring the question of inconsistent reads.)
> > 
> > Why does it not work ?
> > 
> > Are you suggesting, that another backend can somegow see only some bits
> > of page number being written ?
> > 
> > What problems do you see in multiple table case ?
> 
> You need to manage adding and removing relations from the shared memory 
> structure. Which typically needs locking.

My impression was, that Jeff planned to do simple hashing to one of N
values and to write current page number there, maybe not even page nr
but each M-th page number.

Assuming that writing a 4byte page number is atomic op, then there is no
need for locking

> Assuming that relations are added or removed relatively seldom, you 
> might get away with a table of (Oid, BlockNumber) pairs, working around 
> the fact that the table might get messed up every now and then, and when 
> it does, you'll lose the benefits until it gets corrected. But it seems 
> really messy to me.

Just storing page number at offset is cheap (and imho nice and clean
too).

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: old synchronized scan patch

From

"Florian G. Pflug"

Date:

05 December 2006, 10:54:41

Hannu Krosing wrote:
> The worst that can happen, is a hash collision, in which case you lose
> the benefits of sync scans, but you wont degrade compared to non-sync
> scans

But it could cause "mysterious" performance regressions, no?
Image that your app includes two large tables, which are both
scannen frequently. Suppose that synchronous scanning gives this
use-case a noticeable performance boost. Now, you dump and reload
your schema, and suddently the hashes of oids of those tables
collide. You percieve a noticeable drop in performance that you
can neither explain nor fix without a rather deep understanding
of postgres internals.

greetings, Florian Pflug

Re: old synchronized scan patch

From

Tom Lane

Date:

05 December 2006, 11:50:30

"Florian G. Pflug" <fgp@phlo.org> writes:
> Hannu Krosing wrote:
>> The worst that can happen, is a hash collision, in which case you lose
>> the benefits of sync scans, but you wont degrade compared to non-sync
>> scans

> But it could cause "mysterious" performance regressions, no?

There are other issues for the "no lock" approach that Jeff proposes.
Suppose that we have three or four processes that are actually doing
synchronized scans of the same table.  They will have current block
numbers that are similar but probably not identical.  They will all be
scribbling on the same hashtable location.  So if another process comes
along to join the "pack", it might get the highest active block number,
or the lowest, or something in between.  Even discounting the possibility
that it gets random bits because it managed to read the value
non-atomically, how well is that really going to work?

Another thing that we have to consider is that the actual block read
requests will likely be distributed among the "pack leaders", rather
than all being issued by one process.  AFAIK this will destroy the
kernel's ability to do read-ahead, because it will fail to recognize
that sequential reads are being issued --- any single process is *not*
reading sequentially, and I think that read-ahead scheduling is
generally driven off single-process behavior rather than the emergent
behavior of the whole system.  (Feel free to contradict me if you've
actually read any kernel code that does this.)  It might still be better
than unsynchronized reads, but it'd be leaving a lot on the table.
        regards, tom lane

Re: old synchronized scan patch

From

"Florian G. Pflug"

Date:

05 December 2006, 12:09:26

Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Hannu Krosing wrote:
>>> The worst that can happen, is a hash collision, in which case you lose
>>> the benefits of sync scans, but you wont degrade compared to non-sync
>>> scans
> 
>> But it could cause "mysterious" performance regressions, no?
> 
> There are other issues for the "no lock" approach that Jeff proposes.
> Suppose that we have three or four processes that are actually doing
> synchronized scans of the same table.  They will have current block
> numbers that are similar but probably not identical.  They will all be
> scribbling on the same hashtable location.  So if another process comes
> along to join the "pack", it might get the highest active block number,
> or the lowest, or something in between.  Even discounting the possibility
> that it gets random bits because it managed to read the value
> non-atomically, how well is that really going to work?
Could that be adressed by something along the line of
"Don't update the block number if the new value is between
the current number and the current number - update count/2" (
Update count being the number of blocks a backend reads before
updating the hashtable again). At least this would prefent
needless updates to nearly the same block number, which would
help NUMA-Style architectures like Opteron Systems I think.

> Another thing that we have to consider is that the actual block read
> requests will likely be distributed among the "pack leaders", rather
> than all being issued by one process.  AFAIK this will destroy the
> kernel's ability to do read-ahead, because it will fail to recognize
> that sequential reads are being issued --- any single process is *not*
> reading sequentially, and I think that read-ahead scheduling is
> generally driven off single-process behavior rather than the emergent
> behavior of the whole system.  (Feel free to contradict me if you've
> actually read any kernel code that does this.)  It might still be better
> than unsynchronized reads, but it'd be leaving a lot on the table.
I don't see why a single process wouldn't be reading sequentially - As far
as I understood the original proposal, the current blocknumber from the
hashtable is only used as a starting point for sequential scans. After that,
each backend reads sequentiall until the end of the table I believe, no?

greetings, Florian Pflug

Re: old synchronized scan patch

From

"Heikki Linnakangas"

Date:

05 December 2006, 12:23:57

Florian G. Pflug wrote:
> Tom Lane wrote:
>> There are other issues for the "no lock" approach that Jeff proposes.
>> Suppose that we have three or four processes that are actually doing
>> synchronized scans of the same table.  They will have current block
>> numbers that are similar but probably not identical.  They will all be
>> scribbling on the same hashtable location.  So if another process comes
>> along to join the "pack", it might get the highest active block number,
>> or the lowest, or something in between.  Even discounting the possibility
>> that it gets random bits because it managed to read the value
>> non-atomically, how well is that really going to work?

It might in fact work quite well. Assuming that all the active blocks 
are in memory, the process that joins the pack will quickly catch up 
with the rest. I'd like to see some testing to be sure, though...

>> Another thing that we have to consider is that the actual block read
>> requests will likely be distributed among the "pack leaders", rather
>> than all being issued by one process.  AFAIK this will destroy the
>> kernel's ability to do read-ahead, because it will fail to recognize
>> that sequential reads are being issued --- any single process is *not*
>> reading sequentially, and I think that read-ahead scheduling is
>> generally driven off single-process behavior rather than the emergent
>> behavior of the whole system.  (Feel free to contradict me if you've
>> actually read any kernel code that does this.)  It might still be better
>> than unsynchronized reads, but it'd be leaving a lot on the table.
> I don't see why a single process wouldn't be reading sequentially - As far
> as I understood the original proposal, the current blocknumber from the
> hashtable is only used as a starting point for sequential scans. After 
> that,
> each backend reads sequentiall until the end of the table I believe, no?

When the read is satisfies from shared mem cache, it won't make it to 
the kernel. From the kernel point of view, the pattern looks something 
like this:

A 1--4--7-9
B -2---6---
C --3-5--8-

where letters denote processes, and numbers are block numbers read, and 
time goes from left to right. When you look at one process at a time, 
the pattern looks random, though it's constantly moving forward. It's 
only when you look at all the processes together that you see that it's 
sequential.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

Tom Lane

Date:

05 December 2006, 12:45:39

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Florian G. Pflug wrote:
>> I don't see why a single process wouldn't be reading sequentially - As far
>> as I understood the original proposal, the current blocknumber from the
>> hashtable is only used as a starting point for sequential scans. After 
>> that,
>> each backend reads sequentiall until the end of the table I believe, no?

> When the read is satisfies from shared mem cache, it won't make it to 
> the kernel.

Right, and the *whole point* of this proposal is that only one of the N
processes doing a synchronized scan actually does a read of any
particular block.  The problem is that they're not synchronized well
enough to ensure that it's always the same one.

It strikes me that there's still another thing we'd have to deal with
to make this work nicely.  If you have N processes doing a synchronized
scan, then each block that reaches shared memory is going to be hit N
times in fairly short succession --- which is going to be enough to
convince the bufmgr to keep it in memory for awhile.  Thus a
synchronized seqscan is likely to end up flushing buffer cache in a way
that independent seqscans could not.

This could probably be dealt with by changing the ReadBuffer API to
allow the caller to say "don't increment the refcount on this page",
or some such.  But it's some more work to be done if we intend to
take this idea seriously.
        regards, tom lane

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:00:36

On Mon, 2006-12-04 at 21:46 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Since I am not storing any pointers, and since the information is only
> > really a hint, I don't need to do any locking on that page.
> 
> If you think that, you need not bother to submit the patch.  (Hint:
> as soon as you consider more than one table at a time, it doesn't work,
> even ignoring the question of inconsistent reads.)
> 

I believe I accounted for that case correctly. In the unlikely event
that it gets a wrong hint value from the table, it would either:

(1) See that the value is larger than rs_nblocks and start from 0 like
normal
(2) Not know that the value is wrong because it is a valid block for
that relation, and it would use it anyway.

In the case of #2, everything should be fine, because an arbitrary value
is no worse than the arbitrary value of 0 we use right now.

I could always store the Oid in the table also, so that #2 wouldn't
happen unless the table was truncated (by TRUNCATE, CLUSTER, or VACUUM
FULL) after a scan.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:10:23

On Tue, 2006-12-05 at 10:38 +0000, Heikki Linnakangas wrote:
> Hannu Krosing wrote:
> > Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:
> >> Jeff Davis <pgsql@j-davis.com> writes:
> >>> Since I am not storing any pointers, and since the information is only
> >>> really a hint, I don't need to do any locking on that page.
> >> If you think that, you need not bother to submit the patch.  (Hint:
> >> as soon as you consider more than one table at a time, it doesn't work,
> >> even ignoring the question of inconsistent reads.)
> >
> > Why does it not work ?
> >
> > Are you suggesting, that another backend can somegow see only some bits
> > of page number being written ?
> >
> > What problems do you see in multiple table case ?
>
> You need to manage adding and removing relations from the shared memory
> structure. Which typically needs locking.
>
> Assuming that relations are added or removed relatively seldom, you
> might get away with a table of (Oid, BlockNumber) pairs, working around
> the fact that the table might get messed up every now and then, and when
> it does, you'll lose the benefits until it gets corrected. But it seems
> really messy to me.
>

I don't think it's messy if we think of it as a hint system. I think my
problem is that I am calling it Synchronized Scanning, which is a
misnomer because I am not enforcing the synchronization. I call it that
because that's what Simon Riggs called it back when I first suggested
it, but in reality it's just a hint system.

I see that right now we start every scan at 0, which is a completely
arbitrary value. I am just trying to add a little intelligence to it
with the hint, even if sometimes (in very rare circumstances) the
intelligence turns out to be just as dumb as picking the value 0.

That being said, I can just lock the hint table (the shared memory hint
table, not the relation) and only update the hint every K pages, as Niel
Conway suggested when I first proposed it. If we find a K small enough
so the feature is useful, but large enough that we're sure there won't
be contention, this might be a good option. However, I don't know that
we would eliminate the contention, because if K is a constant (rather
than random), the backends would still all want to update that shared
memory table at the same time.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:19:56

On Tue, 2006-12-05 at 10:49 -0500, Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
> > Hannu Krosing wrote:
> >> The worst that can happen, is a hash collision, in which case you lose
> >> the benefits of sync scans, but you wont degrade compared to non-sync
> >> scans
> 
> > But it could cause "mysterious" performance regressions, no?
> 
> There are other issues for the "no lock" approach that Jeff proposes.
> Suppose that we have three or four processes that are actually doing
> synchronized scans of the same table.  They will have current block
> numbers that are similar but probably not identical.  They will all be
> scribbling on the same hashtable location.  So if another process comes
> along to join the "pack", it might get the highest active block number,
> or the lowest, or something in between.  Even discounting the possibility
> that it gets random bits because it managed to read the value
> non-atomically, how well is that really going to work?
> 

That's an empirical question. I expect it will work, since any active
scan will have a significant cache trail behind it.

> Another thing that we have to consider is that the actual block read
> requests will likely be distributed among the "pack leaders", rather
> than all being issued by one process.  AFAIK this will destroy the
> kernel's ability to do read-ahead, because it will fail to recognize
> that sequential reads are being issued --- any single process is *not*
> reading sequentially, and I think that read-ahead scheduling is
> generally driven off single-process behavior rather than the emergent
> behavior of the whole system.  (Feel free to contradict me if you've
> actually read any kernel code that does this.)  It might still be better
> than unsynchronized reads, but it'd be leaving a lot on the table.
> 

That's a very interesting point. I had assumed read-ahead was at the
kernel level without really caring what processes issued the requests,
but it may be system-dependent. I think that's what the elevator (or I/O
scheduler, or whatever it's called) is supposed to do. I'll see if I can
find some relevant source code in a Linux or FreeBSD box.

The controller certainly wouldn't care about process IDs, however.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:26:50

On Tue, 2006-12-05 at 11:45 -0500, Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > Florian G. Pflug wrote:
> >> I don't see why a single process wouldn't be reading sequentially - As far
> >> as I understood the original proposal, the current blocknumber from the
> >> hashtable is only used as a starting point for sequential scans. After 
> >> that,
> >> each backend reads sequentiall until the end of the table I believe, no?
> 
> > When the read is satisfies from shared mem cache, it won't make it to 
> > the kernel.
> 
> Right, and the *whole point* of this proposal is that only one of the N
> processes doing a synchronized scan actually does a read of any
> particular block.  The problem is that they're not synchronized well
> enough to ensure that it's always the same one.

If readahead is per-process (rather than for the entire system), my
implementation would probably fall short. I would like to try to find
out for sure whether this is the case, or not, or whether it's system-
dependent.

> It strikes me that there's still another thing we'd have to deal with
> to make this work nicely.  If you have N processes doing a synchronized
> scan, then each block that reaches shared memory is going to be hit N
> times in fairly short succession --- which is going to be enough to
> convince the bufmgr to keep it in memory for awhile.  Thus a
> synchronized seqscan is likely to end up flushing buffer cache in a way
> that independent seqscans could not.

Interesting. That may be an important consideration. If a bunch of
backends read the block though, I don't see it as a major loss if it
hangs around to see if one more backend needs it.

> This could probably be dealt with by changing the ReadBuffer API to
> allow the caller to say "don't increment the refcount on this page",
> or some such.  But it's some more work to be done if we intend to
> take this idea seriously.
> 

Thank you for the input.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:33:34

On Tue, 2006-12-05 at 15:54 +0100, Florian G. Pflug wrote:
> Hannu Krosing wrote:
> > The worst that can happen, is a hash collision, in which case you lose
> > the benefits of sync scans, but you wont degrade compared to non-sync
> > scans
> 
> But it could cause "mysterious" performance regressions, no?
> Image that your app includes two large tables, which are both
> scannen frequently. Suppose that synchronous scanning gives this
> use-case a noticeable performance boost. Now, you dump and reload
> your schema, and suddently the hashes of oids of those tables
> collide. You percieve a noticeable drop in performance that you
> can neither explain nor fix without a rather deep understanding
> of postgres internals.
> 

A good point. We can hopefully make this relatively rare with a decent
hashing algorithm (right now I just mod by the table size), and a
reasonable-sized table.

For your problem to occur, you'd need two relations which are both
scanned very frequently at the same time and also have a hash
collision. 

We can mitigate the problem by not reporting to the table unless the
table is a minimum size (perhaps related to effective_cache_size), so
that tables that are in memory anyway don't write over another table's
hint.

Or, we could use a dynamic structure, use locking, and only write a hint
every K pages, or something similar.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

05 December 2006, 13:39:56

On Tue, 2006-12-05 at 11:28 +0100, Zeugswetter Andreas ADI SD wrote:
> > To open up the implementation topic: 
> > 
> > My current patch starts a new sequential scan on a given relation at
> the
> > page of an already-running scan.
> 
> I think it should start earlier to make use of the already cached part
> of the table.
> It is hard to determine, or guess how much is still in cache though.
> As a start you could maybe use some (small to be safe) fraction of
> effective_cache_size
> or shared_buffers.

Interesting idea, however it's a little tricky. For instance, we don't
want to start before the oldest cached page, or we have lost the benefit
(just as if we got a bad hint). It also depends on a lot of other parts
of the code, like how scan-resistant the shared buffers are, and how
scan-resistant the OS buffer cache is.

Thoughts?

Regards,Jeff Davis

Re: old synchronized scan patch

From

Tom Lane

Date:

05 December 2006, 13:56:50

Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2006-12-05 at 11:45 -0500, Tom Lane wrote:
>> ... If you have N processes doing a synchronized
>> scan, then each block that reaches shared memory is going to be hit N
>> times in fairly short succession --- which is going to be enough to
>> convince the bufmgr to keep it in memory for awhile.  Thus a
>> synchronized seqscan is likely to end up flushing buffer cache in a way
>> that independent seqscans could not.

> Interesting. That may be an important consideration. If a bunch of
> backends read the block though, I don't see it as a major loss if it
> hangs around to see if one more backend needs it.

Sure, it should hang around for awhile, and will.  The problem is that
its lifetime will be artificially inflated, so that the seqscan ends up
kicking out other blocks that are really of greater importance, rather
than recycling its own old blocks as it should.
        regards, tom lane

Re: old synchronized scan patch

From

Gregory Stark

Date:

05 December 2006, 14:00:30

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Sure, it should hang around for awhile, and will.  The problem is that
> its lifetime will be artificially inflated, so that the seqscan ends up
> kicking out other blocks that are really of greater importance, rather
> than recycling its own old blocks as it should.

I thought you had switched this all to a clock sweep algorithm. The clock
sweep method is supposed to be especially resistant to this since it doesn't
really matter how many times something is accessed, only whether it has been
accessed recently. As long as all the synchronized scans get their work done
before the clock comes around the block will be recycled the next time the
clock sweeps around and it finds nobody else is interested in that block.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: old synchronized scan patch

From

Tom Lane

Date:

05 December 2006, 14:25:47

Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Sure, it should hang around for awhile, and will.  The problem is that
>> its lifetime will be artificially inflated, so that the seqscan ends up
>> kicking out other blocks that are really of greater importance, rather
>> than recycling its own old blocks as it should.

> I thought you had switched this all to a clock sweep algorithm.

Yeah ... it's a clock sweep with counter.  A buffer's counter is
incremented by each access and decremented when the sweep passes over
it.  So multiple accesses allow the buffer to survive longer.  For a
large seqscan you really would rather the counter stayed at zero,
because you want the buffers to be recycled when the sweep comes back
the first time.
        regards, tom lane

Re: old synchronized scan patch

From

"Jim C. Nasby"

Date:

06 December 2006, 14:48:19

On Tue, Dec 05, 2006 at 09:09:39AM -0800, Jeff Davis wrote:
> That being said, I can just lock the hint table (the shared memory hint
> table, not the relation) and only update the hint every K pages, as Niel
> Conway suggested when I first proposed it. If we find a K small enough
> so the feature is useful, but large enough that we're sure there won't
> be contention, this might be a good option. However, I don't know that
> we would eliminate the contention, because if K is a constant (rather
> than random), the backends would still all want to update that shared
> memory table at the same time.

What about some algorithm where only one backend will update the hint
entry (perhaps the first one, or the slowest one (ie: lowest page
number))? ISTM that would eliminate a lot of contention, and if you get
clever with the locking scheme you could probably allow other backends
to do non-blocking reads except when the page number passes a 4-byte
value (assuming 4-byte int updates are atomic).

Granted, coming up with a way to put one backend in charge of this is
non-trivial, but it's another option.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: old synchronized scan patch

From

"Simon Riggs"

Date:

06 December 2006, 15:05:12

On Tue, 2006-12-05 at 13:25 -0500, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> >> Sure, it should hang around for awhile, and will.  The problem is that
> >> its lifetime will be artificially inflated, so that the seqscan ends up
> >> kicking out other blocks that are really of greater importance, rather
> >> than recycling its own old blocks as it should.
> 
> > I thought you had switched this all to a clock sweep algorithm.
> 
> Yeah ... it's a clock sweep with counter.  A buffer's counter is
> incremented by each access and decremented when the sweep passes over
> it.  So multiple accesses allow the buffer to survive longer.  For a
> large seqscan you really would rather the counter stayed at zero,
> because you want the buffers to be recycled when the sweep comes back
> the first time.

If you focus the backends together by synchronizing them you definitely
also then need to solve the problem of false cache reinforcement.

I envisaged that we would handle the problem by having a large SeqScan
reuse its previous buffers so it would avoid polluting the cache. If we
kept track of how many backends were in link-step together (a "Conga")
we would be able to check that a block had not been used by anyone but
the Conga members.

So we'd need rules about
- when to allow a Conga to form (if file is very big we check, if not we
don't, no real need for exact synchronisation in all cases)
- how to join a Conga
- how to leave a Conga if you fall behind

The cost of synchronisation (i.e. LWlocks) is much less than the cost of
non-synchronisation (i.e. lots more I/O).

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

Jeff Davis

Date:

06 December 2006, 15:47:49

On Wed, 2006-12-06 at 19:04 +0000, Simon Riggs wrote:
> I envisaged that we would handle the problem by having a large SeqScan
> reuse its previous buffers so it would avoid polluting the cache. If we
> kept track of how many backends were in link-step together (a "Conga")
> we would be able to check that a block had not been used by anyone but
> the Conga members.
> 

Interesting idea.

> So we'd need rules about
> - when to allow a Conga to form (if file is very big we check, if not we
> don't, no real need for exact synchronisation in all cases)
> - how to join a Conga
> - how to leave a Conga if you fall behind
> 

How would one leave a Conga? What if it's in a user-defined function, or
if it's using the results for a nested loop join or something? So, we
can't wait for a backend to leave voluntarily, it would have to be a
plan where, when the backend goes back to get another block, it realizes
that the train has left the station, and goes at it alone (or makes it's
own Conga).

If you make the join/leave operations such that there is no resistance
at all (no timeout or anything), then it becomes the same as my non-
synchronized proposal, right?

> The cost of synchronisation (i.e. LWlocks) is much less than the cost of
> non-synchronisation (i.e. lots more I/O).
> 

If we can have a good way to synchronize it, that sounds good to me.

Regards,Jeff Davis

Re: old synchronized scan patch

From

"Simon Riggs"

Date:

06 December 2006, 15:56:26

On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote:

> If you make the join/leave operations such that there is no resistance
> at all (no timeout or anything), then it becomes the same as my non-
> synchronized proposal, right?

Teamwork requires some synchronisation to be effective, but yeh there
needs to be a way to leave the Conga if its not working for you/them.

I think we need the synchronisation to make concurrent scans effective,
plus Brownian Scans doesnt have the same ring to it :-)

I'm still willing to help if you're willing to take this further.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

Tom Lane

Date:

06 December 2006, 16:13:22

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote:
>> If you make the join/leave operations such that there is no resistance
>> at all (no timeout or anything), then it becomes the same as my non-
>> synchronized proposal, right?

> Teamwork requires some synchronisation to be effective, but yeh there
> needs to be a way to leave the Conga if its not working for you/them.

Seems like an unworkably complicated mess :-(.  Error cases alone would
be a nightmare.  The good thing about Jeff's proposal is that it's very
robust --- there isn't anything you have to clean up if you abort a scan.

I think all we need as far as buffer management goes is what I suggested
before, namely have seqscans on large tables tell bufmgr not to
increment the usage counter for their accesses.  If it stays zero then
the buffers will be recycled as soon as the sweep gets back to them,
which is exactly what we want.  The window for additional backends to
pick up on the sync scan is then however much of shared_buffers aren't
occupied by blocks being accessed normally.

If we have syncscan members that are spread out over any significant
range of the table's blocks, then the problem of setting the hint
properly becomes a lot more pressing.  We'd like incoming joiners to
start at a fairly low block number, ie not become one of the "pack
leaders" but one of the "trailers" --- since the higher they start,
the more blocks they'll need to re-read at the end of their cycles,
yet those are blocks they could have had "for free" if they'd started
as a trailer.  I don't see any cheap way to bias the behavior in that
direction, though.
        regards, tom lane

Re: old synchronized scan patch

From

Jeff Davis

Date:

06 December 2006, 18:28:52

On Wed, 2006-12-06 at 12:48 -0600, Jim C. Nasby wrote:
> On Tue, Dec 05, 2006 at 09:09:39AM -0800, Jeff Davis wrote:
> > That being said, I can just lock the hint table (the shared memory hint
> > table, not the relation) and only update the hint every K pages, as Niel
> > Conway suggested when I first proposed it. If we find a K small enough
> > so the feature is useful, but large enough that we're sure there won't
> > be contention, this might be a good option. However, I don't know that
> > we would eliminate the contention, because if K is a constant (rather
> > than random), the backends would still all want to update that shared
> > memory table at the same time.
> 
> What about some algorithm where only one backend will update the hint
> entry (perhaps the first one, or the slowest one (ie: lowest page
> number))? ISTM that would eliminate a lot of contention, and if you get
> clever with the locking scheme you could probably allow other backends
> to do non-blocking reads except when the page number passes a 4-byte
> value (assuming 4-byte int updates are atomic).
> 

If we have one backend in charge, how does it pass the torch when it
finishes the scan? I think you're headed back in the direction of an
independent "scanning" process. That's not unreasonable, but there are a
lot of other issues to deal with.

One thought of mine goes something like this: A scanner process starts
up and scans with a predefined length of a cache trail in the
shared_buffers, perhaps a chunk of buffers used like a circular list (so
it doesn't interfere with caching). When a new scan starts, it could
request a block from this scanner process and begin the scan there. If
the new scan keeps up with the scanner process, it will always be
getting cached data. If it falls behind, the request turns into a new
block request. In theory, the scan could actually catch back up to the
scanner process after falling behind.

We could use a meaningful event (like activity on a particular relation)
to start/stop the scanner process.

It's just another idea, but I'm still not all that sure that
synchronization is necessary.

Does anyone happen to have an answer on whether OS-level readahead is
system-wide, or per-process? I expect that it's system wide, but Tom
raised the issue and it may be a drawback if some OSs do per-process
readahead.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

06 December 2006, 19:02:53

On Wed, 2006-12-06 at 19:55 +0000, Simon Riggs wrote:
> On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote:
> 
> > If you make the join/leave operations such that there is no resistance
> > at all (no timeout or anything), then it becomes the same as my non-
> > synchronized proposal, right?
> 
> Teamwork requires some synchronisation to be effective, but yeh there
> needs to be a way to leave the Conga if its not working for you/them.
> 
> I think we need the synchronisation to make concurrent scans effective,
> plus Brownian Scans doesnt have the same ring to it :-)

How about "Scan Hinting", or "Smart Scans"?

Although, "smart" might be too strong a word ;)

Regards,Jeff Davis

Re: old synchronized scan patch

From

Michael Glaesemann

Date:

06 December 2006, 20:15:04

On Dec 7, 2006, at 8:01 , Jeff Davis wrote:

> On Wed, 2006-12-06 at 19:55 +0000, Simon Riggs wrote:
>> Teamwork requires some synchronisation to be effective, but yeh there
>> needs to be a way to leave the Conga if its not working for you/them.
>>
>> I think we need the synchronisation to make concurrent scans  
>> effective,
>> plus Brownian Scans doesnt have the same ring to it :-)
>
> How about "Scan Hinting", or "Smart Scans"?

I must admit, I think "Conga Scan" has a certain ring to it :)

Michael Glaesemann
grzm seespotcode net

Re: old synchronized scan patch

From

"Simon Riggs"

Date:

06 December 2006, 20:32:40

On Wed, 2006-12-06 at 15:12 -0500, Tom Lane wrote:

> I think all we need as far as buffer management goes is what I suggested
> before, namely have seqscans on large tables tell bufmgr not to
> increment the usage counter for their accesses.  If it stays zero then
> the buffers will be recycled as soon as the sweep gets back to them,
> which is exactly what we want.  

This is good, yet it addresses only the non-cache spoiling behaviour.

BTW, we would need something special to spot and Append node with
multiple  SeqScans occurring on partitioned tables in sequence.
Individual scans may not be that large but overall the set could be
huge. There's not much point implementing behaviour such as "table must
be bigger than 10x shared_buffers" because it works directly against the
role of partitioning.

> The window for additional backends to
> pick up on the sync scan is then however much of shared_buffers aren't
> occupied by blocks being accessed normally.

Non-cache spoiling means window reduction, so you can't catch it by
chance.

> If we have syncscan members that are spread out over any significant
> range of the table's blocks, then the problem of setting the hint
> properly becomes a lot more pressing.  We'd like incoming joiners to
> start at a fairly low block number, ie not become one of the "pack
> leaders" but one of the "trailers" --- since the higher they start,
> the more blocks they'll need to re-read at the end of their cycles,
> yet those are blocks they could have had "for free" if they'd started
> as a trailer.  I don't see any cheap way to bias the behavior in that
> direction, though.

Well that's the problem. Agree about the pack leaders/trailers.

What's the solution?

You can synchronise every block or every N blocks, but otherwise: how
will you know the optimal point to start the scan? That will require
*some* synchronisation to be optimal.

Most scans don't go at the same rate naturally. Different plans do
different amounts of work between page requests. Allowing them to go
their own individual ways would be very wasteful of I/O resources, so
making some people wait for others is an essential aspect of efficiency,
just like it is with the OS.

Synchronisation costs very little in comparison with the I/O it saves.

Perhaps there are ways of doing this without central control, so that
backend error conditions don't need to be considered. Seems like a
simple timeout would be sufficient to exclude backends from the Conga,
which would be sufficient to handle error cases and special cases.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

Jeff Davis

Date:

06 December 2006, 21:00:22

On Wed, 2006-12-06 at 15:12 -0500, Tom Lane wrote:
> > Teamwork requires some synchronisation to be effective, but yeh there
> > needs to be a way to leave the Conga if its not working for you/them.
> 
> If we have syncscan members that are spread out over any significant
> range of the table's blocks, then the problem of setting the hint
> properly becomes a lot more pressing.  We'd like incoming joiners to
> start at a fairly low block number, ie not become one of the "pack
> leaders" but one of the "trailers" --- since the higher they start,
> the more blocks they'll need to re-read at the end of their cycles,
> yet those are blocks they could have had "for free" if they'd started
> as a trailer.  I don't see any cheap way to bias the behavior in that
> direction, though.
> 

This is speculation, but I suspect that there already is somewhat of a
bias favoring the lower block numbers. If you have a pack near the head,
and a couple trailers 1000 blocks behind, the trailers are likely to be
moving through those blocks very quickly -- and reporting their location
much more often -- than the blocks at the head that are awaiting a disk
fetch.

Even if there are 50 in the pack, and 2 trailing, at any point in time
it's more likely that the last block number reported was reported by a
trailing scan. The pack might all report at once when they finally get
the block, but will be promptly overwritten by the continuous stream of
reports from trailing scans.

However, if my analysis was really true, one might wonder how those
scans got that far behind in the first place.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Tom Lane

Date:

07 December 2006, 01:46:59

Jeff Davis <pgsql@j-davis.com> writes:
> Even if there are 50 in the pack, and 2 trailing, at any point in time
> it's more likely that the last block number reported was reported by a
> trailing scan. The pack might all report at once when they finally get
> the block, but will be promptly overwritten by the continuous stream of
> reports from trailing scans.

> However, if my analysis was really true, one might wonder how those
> scans got that far behind in the first place.

Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
to provide an indication whether it had to issue an actual read() or
found the block in cache?  Could it be useful to not report the block
location to the hint area if we had to actually read()?  That would
eliminate the immediate "pack leader" from the equation.  The problem
is that it seems to break things for the case of the first follower
joining a seqscan, because the original leader would never report.
Anyone see the extra idea needed to make this work?
        regards, tom lane

Re: old synchronized scan patch

From

"Jim C. Nasby"

Date:

07 December 2006, 02:33:13

On Thu, Dec 07, 2006 at 12:46:34AM -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Even if there are 50 in the pack, and 2 trailing, at any point in time
> > it's more likely that the last block number reported was reported by a
> > trailing scan. The pack might all report at once when they finally get
> > the block, but will be promptly overwritten by the continuous stream of
> > reports from trailing scans.
> 
> > However, if my analysis was really true, one might wonder how those
> > scans got that far behind in the first place.
> 
> Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
> to provide an indication whether it had to issue an actual read() or
> found the block in cache?  Could it be useful to not report the block
> location to the hint area if we had to actually read()?  That would
> eliminate the immediate "pack leader" from the equation.  The problem
> is that it seems to break things for the case of the first follower
> joining a seqscan, because the original leader would never report.
> Anyone see the extra idea needed to make this work?

The first reader won't find an entry in shared memory, so it will know
it's the first. It could then either always update, or it could check to
see if anyone else has updated shared memory behind it's back; at that
point it could switch to only updating on an actual read.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: old synchronized scan patch

From

Csaba Nagy

Date:

07 December 2006, 05:39:47

> Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
> to provide an indication whether it had to issue an actual read() or
> found the block in cache?  Could it be useful to not report the block
> location to the hint area if we had to actually read()?  That would
> eliminate the immediate "pack leader" from the equation.  The problem
> is that it seems to break things for the case of the first follower
> joining a seqscan, because the original leader would never report.
> Anyone see the extra idea needed to make this work?

1) How about 2 location hints, one for the actual reads, and one for the
followers ? The actual read() case will write to the one, the followers
to the other. There could be some magic to make the follower which is
getting close to the pack leader to report less often, so that the
follower hint will still be covering cached buffers far behind enough
(tricky business to tune this), and doesn't skip them immediately. Then
some magic to figure out which one to follow... or even try both (at the
start of the scan only), and go with the one you had no actual read()
on, or the smaller one if both had read() or both didn't.

2) Make the pack leader (as in it had to do actual read())  report less
often than the followers (say every 100 read(), while followers report
every 10 - numbers made up, no idea what would work best here). This
will make the followers hint be more likely to be picked up by new
scans, thus better using the existing cache...

Cheers,
Csaba.

Re: old synchronized scan patch

From

"Florian G. Pflug"

Date:

07 December 2006, 05:48:30

Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> Even if there are 50 in the pack, and 2 trailing, at any point in time
>> it's more likely that the last block number reported was reported by a
>> trailing scan. The pack might all report at once when they finally get
>> the block, but will be promptly overwritten by the continuous stream of
>> reports from trailing scans.
> 
>> However, if my analysis was really true, one might wonder how those
>> scans got that far behind in the first place.
> 
> Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
> to provide an indication whether it had to issue an actual read() or
> found the block in cache?  Could it be useful to not report the block
> location to the hint area if we had to actually read()?  That would
> eliminate the immediate "pack leader" from the equation.  The problem
> is that it seems to break things for the case of the first follower
> joining a seqscan, because the original leader would never report.
> Anyone see the extra idea needed to make this work?

What if there were two blocknumbers (last_disk_read_blocknr, and last_cache_read_blocknr)
stored per table, together with a timestamp telling when the last updated occured?
Values older than let's say a second or so would be treated as "outdated".

If last_cache_read_blocknr isn't outdated, it would be used as a starting point for seqscans,
otherwise last_disk_read_blocknr would be used if that one isn't outdated. If both are
outdates, it would start at the lower of the two blocknumbers.

greetings, Florian Pflug

Re: old synchronized scan patch

From

Gregory Stark

Date:

07 December 2006, 06:57:25

"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Tue, 2006-12-05 at 13:25 -0500, Tom Lane wrote:
>> Gregory Stark <stark@enterprisedb.com> writes:
>> > "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> >> Sure, it should hang around for awhile, and will.  The problem is that
>> >> its lifetime will be artificially inflated, so that the seqscan ends up
>> >> kicking out other blocks that are really of greater importance, rather
>> >> than recycling its own old blocks as it should.
>> 
>> > I thought you had switched this all to a clock sweep algorithm.
>> 
>> Yeah ... it's a clock sweep with counter.  A buffer's counter is
>> incremented by each access and decremented when the sweep passes over
>> it.  So multiple accesses allow the buffer to survive longer.  For a
>> large seqscan you really would rather the counter stayed at zero,
>> because you want the buffers to be recycled when the sweep comes back
>> the first time.
>
> If you focus the backends together by synchronizing them you definitely
> also then need to solve the problem of false cache reinforcement.

Well the clock sweep algorithm is inherently resistant to this. The classical
algorithm only had a single bit to indicate whether the buffer had been
touched since hand came around last.

If you have a counter you just need to limit the size of that counter. The
larger the maximum value of the counter the slower the algorithm will be to
learn new access patterns. The smaller the counter is the shorter its memory
will be.

I do vaguely recall discussing having a counter in school but IIRC it was just
a two bit counter which the clock sweep shifted right. So a buffer hit by a
synchronized would require two sweeps to get flushed instead of one but that
would be true regardless of how many synchronized scans hit it.

I think the no-counter (ie, single bit of state) is probably exactly what you
want in this case. If there are synchronized scans running there may be more
coming along behind you. You do want to keep those buffers around long enough
to be used by them. If there aren't any synchronized scans running then you
probably want to just keep reusing the same buffers which is what will happen
when the hand comes around the next time if nobody else has looked at.

It may be enough to just cap the use-counter at different values when you
increment it depending on the type of access. In a sequential scan don't
increment it beyond 1 (ie, just a single bit of state). In a random I/O
increment it up to 2. It may also be useful to have different initial use
values. Or maybe it makes more sense to look at initial values. If you load a
buffer in a sequential scan perhaps its use count could start at 0 whereas
random i/o could start at 1.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: old synchronized scan patch

From

Gregory Stark

Date:

07 December 2006, 07:05:52

"Csaba Nagy" <nagy@ecircle-ag.com> writes:

> 1) How about 2 location hints, one for the actual reads, and one for the
> followers ? 

Having the location of the pack leader may be useful for another reason too:
It would give the trailers a chance to avoid competing with the pack leader.

I would expect no matter how far back they start they'll quickly catch up to
the pack leader since they're only doing cached reads and the pack leader is
having to do actual i/o. When they catch up they're going to compete for the
lead which may a) kill the kernel read-ahead and b) cause contention.

If they know where the oldest trailer is and where the leader is then when
they catch up they could sleep for a small amount of time to give the leader a
chance to make progress.

I would suggest they should only sleep once and if they catch up again they
should try to take over and become the new leader. If the leader ever finds
its reading a cached block and someone else has passed it it should switch to
follower behaviour. That way the leadership does change hands periodically but
not on every read. Just in case the first leader is, say, running a nested
loop or something we don't want the follower to be unnecessarily limited to
its speed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: old synchronized scan patch

From

"Heikki Linnakangas"

Date:

07 December 2006, 12:15:02

Matthew O'Connor wrote:
> 
> Could we have a counter in shared memory that keeps a count on the 
> number seq scanners currently working on a table?  While count = 1, we 
> report all blocks regardless if it's from disk or from buffer.  Once 
> count > 1 we only report buffer reads.  Would this have locking problems?

I don't know, but it would require some extra care to make sure that the 
counter is decremented when a for example a transaction is aborted, or a 
backend is killed.

> BTW, it seems to me that this is all based on the assumption that 
> followers will have no problem keeping up with the pack leader.  Suppose 
> my process does a lot of other processing and can't keep up with the 
> pack despite the fact that it's getting all it's data from the buffer. 
> Now we have effectively have two different seq scans going on.  Does my 
> process need to recognize that it's not keeping up and not report it's 
> blocks?

That's what I was wondering about all these schemes as well. What we 
could do, is that instead of doing a sequential scan, each backend keeps 
a bitmap of pages it has processed during the scan, and read the pages 
in the order they're available in cache. If a backend misses a page in 
the synchronized scan, for example, it could issue a random I/O after 
reading all the other pages to fetch it separately, instead of starting 
another seq scan at a different location and "derailing the train". I 
don't know what the exact algorithm used to make decisions on when and 
how to fetch each page would be, but the bitmaps would be in backend 
private memory. And presumably it could be used with bitmap heap scans 
as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: old synchronized scan patch

From

"Kevin Grittner"

Date:

07 December 2006, 12:25:26

>>> On Thu, Dec 7, 2006 at  4:57 AM, in message <874ps8km81.fsf@enterprisedb.com>,
Gregory Stark <stark@enterprisedb.com> wrote:

> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>
>> If you focus the backends together by synchronizing them you definitely
>> also then need to solve the problem of false cache reinforcement.
>
> Well the clock sweep algorithm is inherently resistant to this. The
> classical
> algorithm only had a single bit to indicate whether the buffer had been
> touched since hand came around last.
>
> If you have a counter you just need to limit the size of that counter. The
> larger the maximum value of the counter the slower the algorithm will be to
> learn new access patterns. The smaller the counter is the shorter its memory
> will be.
>
> I do vaguely recall discussing having a counter in school but IIRC it was
> just
> a two bit counter which the clock sweep shifted right. So a buffer hit by a
> synchronized would require two sweeps to get flushed instead of one but that
> would be true regardless of how many synchronized scans hit it.
>
> I think the no- counter (ie, single bit of state) is probably exactly what you
> want in this case. If there are synchronized scans running there may be more
> coming along behind you. You do want to keep those buffers around long
> enough
> to be used by them. If there aren't any synchronized scans running then you
> probably want to just keep reusing the same buffers which is what will
> happen
> when the hand comes around the next time if nobody else has looked at.
>
> It may be enough to just cap the use- counter at different values when you
> increment it depending on the type of access. In a sequential scan don't
> increment it beyond 1 (ie, just a single bit of state). In a random I/O
> increment it up to 2. It may also be useful to have different initial use
> values. Or maybe it makes more sense to look at initial values. If you load
> a
> buffer in a sequential scan perhaps its use count could start at 0 whereas
> random i/o could start at 1.
For what little it's worth, the algorithm which tested out best for me in a similar situation in a database I wrote
whichwas used in production environments for decades was to use a 16 bit use counter which was effectively multiplied
by0.75 (with shifts and an add)  in a sweep triggered by an I/O count.  I know we tried different multiples of the
bufferslot count for the trigger point, to find the sweet spot; unfortunately, I don't remember where that was.  This
seemedto provide pretty good protection against transient activity flushing the buffer. 
-Kevin

Re: old synchronized scan patch

From

Jeff Davis

Date:

07 December 2006, 16:19:32

On Thu, 2006-12-07 at 00:46 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Even if there are 50 in the pack, and 2 trailing, at any point in time
> > it's more likely that the last block number reported was reported by a
> > trailing scan. The pack might all report at once when they finally get
> > the block, but will be promptly overwritten by the continuous stream of
> > reports from trailing scans.
> 
> > However, if my analysis was really true, one might wonder how those
> > scans got that far behind in the first place.
> 
> Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
> to provide an indication whether it had to issue an actual read() or
> found the block in cache?  Could it be useful to not report the block
> location to the hint area if we had to actually read()?  That would
> eliminate the immediate "pack leader" from the equation.  The problem
> is that it seems to break things for the case of the first follower
> joining a seqscan, because the original leader would never report.
> Anyone see the extra idea needed to make this work?
> 

My initial thought is that eliminating the immediate pack leader won't
do much good, because there's a good chance at least one scan is
following very closely (we would hope).

Also, there's a lot of focus here on not starting where the pack leader
is. The assumption is that the follower will be closer to the end of a
cache trail. Let's attack the problem more directly by taking the hint
and subtracting a number before choosing it as the start location.

Let's call the number M, which would be a fraction of the
effective_cache_size. Each scan would issue no reports at all until the
current page minus the starting page number for that scan was greater
than M. If we choose M conservatively, there's a very high chance that
there will exist a continuous stream of cached pages between where the
new scan starts (the hint - M) and the head of the pack.

This keeps the code very simple. I could modify my patch to do this in a
few minutes.

However, I do agree that it's always worth thinking about ways to use
the information that we do have about what's in cache at higher layers;
and also useful if higher layers can tell the caching layers what to
cache or not.

Regards,Jeff Davis

Re: old synchronized scan patch

From

Jeff Davis

Date:

07 December 2006, 18:02:03

On Thu, 2006-12-07 at 00:32 -0600, Jim C. Nasby wrote:
> On Thu, Dec 07, 2006 at 12:46:34AM -0500, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > Even if there are 50 in the pack, and 2 trailing, at any point in time
> > > it's more likely that the last block number reported was reported by a
> > > trailing scan. The pack might all report at once when they finally get
> > > the block, but will be promptly overwritten by the continuous stream of
> > > reports from trailing scans.
> > 
> > > However, if my analysis was really true, one might wonder how those
> > > scans got that far behind in the first place.
> > 
> > Yah.  Something I was idly wondering about: suppose we teach ReadBuffer
> > to provide an indication whether it had to issue an actual read() or
> > found the block in cache?  Could it be useful to not report the block
> > location to the hint area if we had to actually read()?  That would
> > eliminate the immediate "pack leader" from the equation.  The problem
> > is that it seems to break things for the case of the first follower
> > joining a seqscan, because the original leader would never report.
> > Anyone see the extra idea needed to make this work?
> 
> The first reader won't find an entry in shared memory, so it will know
> it's the first. It could then either always update, or it could check to

That requires that scans clear the hint when they finish, right? I don't
think that would be difficult, but my current patch doesn't do that.

> see if anyone else has updated shared memory behind it's back; at that
> point it could switch to only updating on an actual read.

Interesting idea.

Regards,Jeff Davis

Re: old synchronized scan patch

From

"Jim C. Nasby"

Date:

08 December 2006, 02:59:55

On Thu, Dec 07, 2006 at 04:14:54PM +0000, Heikki Linnakangas wrote:
> >BTW, it seems to me that this is all based on the assumption that 
> >followers will have no problem keeping up with the pack leader.  Suppose 
> >my process does a lot of other processing and can't keep up with the 
> >pack despite the fact that it's getting all it's data from the buffer. 
> >Now we have effectively have two different seq scans going on.  Does my 
> >process need to recognize that it's not keeping up and not report it's 
> >blocks?
> 
> That's what I was wondering about all these schemes as well. What we 
> could do, is that instead of doing a sequential scan, each backend keeps 
> a bitmap of pages it has processed during the scan, and read the pages 
> in the order they're available in cache. If a backend misses a page in 
> the synchronized scan, for example, it could issue a random I/O after 
> reading all the other pages to fetch it separately, instead of starting 
> another seq scan at a different location and "derailing the train". I 
> don't know what the exact algorithm used to make decisions on when and 
> how to fetch each page would be, but the bitmaps would be in backend 
> private memory. And presumably it could be used with bitmap heap scans 
> as well.

First, I think that we're getting ahead of ourselves by worrying about
how to deal with diverging scans right now. Having said that...

There's 3 ranges of seqscan speeds we need to be concerned with [1],
with 2 break-points between them:

1) Seqscan is I/O bound; it can completely keep up with incoming blocks
-- Maximum single scan rate - the I/O system is saturated
2) Seqscan is CPU-bound if there's nothing else competing for I/O
-- Maximum double-scan rate - maximum rate that 2 scans in different places
can achieve.
3) Seqscan is slower than 2 competing category 1 scans are today

[1] I'm assuming only 1 slow scan and any number of fast scans. If there
are actually 2 slow scans, things will differ.

If every scan is running in the first category, then there is no issue.
Since all scans are completely I/O bound, they'll all process blocks as
soon as they're in from the drive.

If there is a scan in the 3rd category, we don't want to try and hold
faster scans down to the rate of the slow scan, because that's actually
slower than if we let a synchronized set of fast scans compete with the
slow scan, even though there's a lot of seeking involved.

The problem is if we have a scan in the 2nd category. As soon as it
falls far enough behind that the system is forced to issue physical
reads to disk for it, the performance of all the scans will plummet. And
as the slow scan falls further and further behind, seek times will get
longer and longer.

To put real numbers to this, on my machine, having 2 dd's running on one
file far enough apart that caching isn't helping cuts the rate from
~35MB/s to ~31MB/s (incidentally, starting a second dd about 10 seconds
after the first gives the second scan a rate of 40MB/s). So on my
machine, if a slow scan is doing more than 22MB/s, it's better to
restrict all scans to it's speed, rather than have 2 divergent scans.
OTOH, if the slow scan is doing less than 22MB/s, it's better not to
hold the fast scans back.

Of course, having people run that check themselves might not be terribly
practical, and if there's more than one slow scan involved those
measurements are likely to be meaningless. So I'm wondering if there's
some technique we can use that will make this selection process more
'automatic'. The only thought I've come up with is to apply a delay to
every read from the fast scans if there is a slow scan that's in danger
of falling past the cache size. My theory is that if the delay is set
right, then a category 2 scan will eventually catch back up, at which
point we could either just let it drive the scan, or we could
un-throttle the fast scans until the slow scan is in danger of falling
behind again.

If instead the slow scan is category 3, then even with the delay it will
still fall behind. If the database can detect that situation, it can
then un-throttle the fast scans and just let things diverge. Ideally, if
another category 3 scan came along, we would sync it to the existing cat
3 scan.

Unfortunately, that still means needing to come up with what that delay
setting should be. Perhaps if we're lucky, there's a pretty fixed
relationship between the maximum speed of a scan and the speed of two
competing scans. If that's the case, I think we could set the delay
to be a fraction of the average length of a scan.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)