Thread: old synchronized scan patch
Now that 8.3 is open, I was considering a revival of this old patch: http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php I could probably clean it up with a little help from someone on this list. Advantages of this patch: If multiple seq scans are going in parallel on the same table, it can drastically increase cache hit rate by synchronizing the scans. In theory, it should also prevent the problem where sequential scans can turn into random access from the disk. Disadvantages: * sequential scans no longer return results in a deterministic order. As I understand it, this has effects on the regression tests and possibly a few other locations in the code. * While the in the use case, it should improve performance quite a lot. However, the use case is quite narrow: you need to be running sequential scans that are reading tuples from the same table at the same time. That means the table can't fit into memory, but must be small enough that it is sane to be executing multiple sequential scans on it in parallel. Is there some interest in this patch? How would I go about proving whether it's useful enough or not? Regards,Jeff Davis
Jeff, On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote: > Now that 8.3 is open, I was considering a revival of this old patch: > > http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php > > I could probably clean it up with a little help from someone on this > list. > > Advantages of this patch: If multiple seq scans are going in parallel on > the same table, it can drastically increase cache hit rate by > synchronizing the scans. In theory, it should also prevent the problem > where sequential scans can turn into random access from the disk. > > Disadvantages: > * sequential scans no longer return results in a deterministic order. As > I understand it, this has effects on the regression tests and possibly a > few other locations in the code. > * While the in the use case, it should improve performance quite a lot. > However, the use case is quite narrow: you need to be running sequential > scans that are reading tuples from the same table at the same time. That > means the table can't fit into memory, but must be small enough that it > is sane to be executing multiple sequential scans on it in parallel. > > Is there some interest in this patch? Yes. For a scenario of interest let's assume: 1 - the Postgres cache is "scan resistant" such that a seq scan of a large table is not mapped into the block cache. 2 - the O/S cache is not "scan resistant" so that it will map all blocks read through the cache 3 - there is I/O prefetch in the OS that keeps the I/O subsystem efficient under multi-user sequential scanning of different tables/files With (2), if you issue 5 queries that all seq scan a table that is much larger than memory, they will complete in nearly the time of one query. I've tested this up to 1,000 simultaneous queries on Linux and get huge benefits. This scenario is only sometimes helpful in real production usage. Given (2), I think sync scan would only have a multi-user benefit when the scans occur in periods separated by more than the amount of time it takes to scan a cache size worth of data from disk. For example: If there is 16GB of RAM, the OS I/O cache might have room for 14GB of blocks. On a good (ahem) system it might take 14 seconds to scan that amount of data into cache from disk (yes, 1GB/s). If a new backend begins scanning after that 14 seconds, it won't benefit from the data in the cache because the new data from disk replaces the oldest data in cache and so on. Since we're talking O(seconds) separation between user queries, I think we could have a benefit. Where I think sync scan could have a big benefit is for multi-user business intelligence workloads where there are a few huge fact tables of interest to a wide audience. Example: 5 business analysts come to work at 9AM and start ad-hoc queries expected to run in about 15 minutes each. Each query sequential scans a 10 billion row fact table once, which takes 10 minutes of the query runtime. With sync scan the last one completes in 35 minutes. Without sync scan the last completes in 75 minutes. In this case sync scan significantly improves the experience of 5 people. > How would I go about proving whether it's useful enough or not? Can you run the above scenario on a table whose size is ten times the memory on the machine? As a simple starting point, a simple "SELECT COUNT(*) FROM BIGTABLE" should be sufficient, but the scans need to be separated by enough time to invalidate the OS I/O cache. - Luke
On Mon, 2006-12-04 at 15:03 -0800, Luke Lonergan wrote: > Jeff, > > Now that 8.3 is open, I was considering a revival of this old patch: > > > > http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php > > > > I could probably clean it up with a little help from someone on this > > list. > > > > > > Is there some interest in this patch? > > Yes. > <snip> > Where I think sync scan could have a big benefit is for multi-user business > intelligence workloads where there are a few huge fact tables of interest to > a wide audience. Example: 5 business analysts come to work at 9AM and start > ad-hoc queries expected to run in about 15 minutes each. Each query > sequential scans a 10 billion row fact table once, which takes 10 minutes of > the query runtime. With sync scan the last one completes in 35 minutes. > Without sync scan the last completes in 75 minutes. In this case sync scan > significantly improves the experience of 5 people. > Thank you for your input. > > How would I go about proving whether it's useful enough or not? > > Can you run the above scenario on a table whose size is ten times the memory > on the machine? As a simple starting point, a simple "SELECT COUNT(*) FROM > BIGTABLE" should be sufficient, but the scans need to be separated by enough > time to invalidate the OS I/O cache. > I'll try to run a test like that this week. I will be doing this on my home hardware (bad, consumer-grade stuff), so if I gave you a patch against HEAD could you test it against some more real hardware (and data)? To open up the implementation topic: My current patch starts a new sequential scan on a given relation at the page of an already-running scan. It makes no guarantees that the scans stay together, but in practice I don't think they deviate much. To try to enforce synchronization of scanning I fear would do more harm than good. Thoughts? Also, it's more of a "hint" system that uses a direct mapping of the relations Oid to hold the position of the scan. That means that, in rare cases, the page offset could be wrong, in which case it will degenerate to the current performance characteristics with no cost. The benefit of doing it this way is that it's simple code, with essentially no performance penalty or additional locking. Also, I can use a fixed amount of shared memory (1 page is about right). Regards,Jeff Davis
"Luke Lonergan" <llonergan@greenplum.com> writes: > On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote: >> Now that 8.3 is open, I was considering a revival of this old patch: >> http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php > Where I think sync scan could have a big benefit is for multi-user business > intelligence workloads where there are a few huge fact tables of interest to > a wide audience. Example: 5 business analysts come to work at 9AM and start > ad-hoc queries expected to run in about 15 minutes each. The problem I've got with this is that the use-case seems a bit narrow (maybe not to Greenplum, but compared to the general Postgres audience) whereas the proposed patch is invasive in terms of changing well-established behavior, and what's worse, potentially *reducing* performance for average workloads. In particular I'm concerned about the shared-memory area where the scan status is stored becoming a contention bottleneck. It would presumably have access demand comparable to the bufmgr, ie one hit per page scanned, and we already know that it's extremely hard to keep bufmgr from being a major source of contention. One thing you might consider doing is to have the patch do nothing when scanning tables that are less than, perhaps, 10x shared_buffers long. This would at least avoid the regression-test-breakage problem (at the cost of not having the patch tested at all by said tests :-(). Anyway I think the major stumbling block is to be able to show that the patch has only negligible performance impact in cases where it's not able to provide a benefit. regards, tom lane
On Mon, 2006-12-04 at 18:45 -0500, Tom Lane wrote: > "Luke Lonergan" <llonergan@greenplum.com> writes: > > On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote: > >> Now that 8.3 is open, I was considering a revival of this old patch: > >> http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php > > > Where I think sync scan could have a big benefit is for multi-user business > > intelligence workloads where there are a few huge fact tables of interest to > > a wide audience. Example: 5 business analysts come to work at 9AM and start > > ad-hoc queries expected to run in about 15 minutes each. > > Anyway I think the major stumbling block is to be able to show that the > patch has only negligible performance impact in cases where it's not > able to provide a benefit. My patch was specifically designed with this in mind, and unless I'm missing something, should have negligible performance impact. The initial implementation directly maps the Oid of the relation to a location in an shared-memory table (not table as in relation, but table as in area of memory). Then, it stores the page offset for that relation each time it fetches a page for that relation in a sequential scan. When a scan starts, it checks that number to see if it's in the valid range for the relation's file, and if it is, that's where it starts the scan in the file. If it's not in the valid range, it starts at offset 0. Since I am not storing any pointers, and since the information is only really a hint, I don't need to do any locking on that page. In the case of a collision in the in-memory table, or if the offset in the relation overflows the data type used to store it, it should be no worse than current behavior. Regards,Jeff Davis
Jeff, On 12/4/06 3:38 PM, "Jeff Davis" <pgsql@j-davis.com> wrote: > I'll try to run a test like that this week. I will be doing this on my > home hardware (bad, consumer-grade stuff), so if I gave you a patch > against HEAD could you test it against some more real hardware (and > data)? Sure. > To open up the implementation topic: > > My current patch starts a new sequential scan on a given relation at the > page of an already-running scan. It makes no guarantees that the scans > stay together, but in practice I don't think they deviate much. To try > to enforce synchronization of scanning I fear would do more harm than > good. Thoughts? I think this is good enough - a "background scanner" approach would be the logical alternative, but may not be necessary. I suspect you are correct about the scans being nearly synced. This may not be the case if and when we implement a priority based scheduler, but in that case we're already managing the throughput based on content anyway. > Also, it's more of a "hint" system that uses a direct mapping of the > relations Oid to hold the position of the scan. That means that, in rare > cases, the page offset could be wrong, in which case it will degenerate > to the current performance characteristics with no cost. The benefit of > doing it this way is that it's simple code, with essentially no > performance penalty or additional locking. Also, I can use a fixed > amount of shared memory (1 page is about right). If I understand correctly, Tom's concern is that this page is potentially accessed once for every page read and may consequently become very hot. How do you manage the scan position so this doesn't happen? Do we have any readahead in the I/O layer? There is certainly readahead in the OS I/O cache, but it's dynamic and we don't know the block position... - Luke
On Mon, 2006-12-04 at 16:47 -0800, Luke Lonergan wrote: > Jeff, > > My current patch starts a new sequential scan on a given relation at the > > page of an already-running scan. It makes no guarantees that the scans > > stay together, but in practice I don't think they deviate much. To try > > to enforce synchronization of scanning I fear would do more harm than > > good. Thoughts? > > I think this is good enough - a "background scanner" approach would be the > logical alternative, but may not be necessary. I suspect you are correct > about the scans being nearly synced. > A background scanner requires synchronizing the backends, which can cause all kinds of bad performance problems. Otherwise, how would you ensure that ALL the backends that need a page get it before the train moves on? I don't think it's necessary or desirable to have a background scanner. > This may not be the case if and when we implement a priority based > scheduler, but in that case we're already managing the throughput based on > content anyway. > Right. I don't see how you'd be able to get the data to a backend that needs it without running that backend. If it's a priority scheduler, you may not run that backend. > > Also, it's more of a "hint" system that uses a direct mapping of the > > relations Oid to hold the position of the scan. That means that, in rare > > cases, the page offset could be wrong, in which case it will degenerate > > to the current performance characteristics with no cost. The benefit of > > doing it this way is that it's simple code, with essentially no > > performance penalty or additional locking. Also, I can use a fixed > > amount of shared memory (1 page is about right). > > If I understand correctly, Tom's concern is that this page is potentially > accessed once for every page read and may consequently become very hot. How > do you manage the scan position so this doesn't happen? Do we have any > readahead in the I/O layer? There is certainly readahead in the OS I/O > cache, but it's dynamic and we don't know the block position... > My proposal is hint-based, and I consider any reads coming from that data page to be untrusted. I can just statically map the Oid of the relation onto a location in the page (which may or may not collide with another Oid). The advantage here is that I don't need to lock. There's no contention because every access is just reading a word or two from the shared page (or writing a word or two). Of course, it must be a static data structure because it can't contain pointers. But 8kb should be enough to hold information on plenty of interesting relations. I don't trust the data because collisions are possible. If the number is obviously wrong (higher than number of pages in relation file), a new scan starts at offset 0. If it's wrong but that can't be detected, it will start at that location anyway, which is OK because that arbitrary value is no worse than the arbitrary value of 0. The whole premise of my implementation is: (1) Get 99% of the benefit (2) Pay near-zero performance penalty, even in worst case. (3) Simple code changes If those 3 things aren't true, let me know. The way I see it, the only real cost is that it may break things that assume deterministic sequential scans. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Since I am not storing any pointers, and since the information is only > really a hint, I don't need to do any locking on that page. If you think that, you need not bother to submit the patch. (Hint: as soon as you consider more than one table at a time, it doesn't work, even ignoring the question of inconsistent reads.) regards, tom lane
Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane: > Jeff Davis <pgsql@j-davis.com> writes: > > Since I am not storing any pointers, and since the information is only > > really a hint, I don't need to do any locking on that page. > > If you think that, you need not bother to submit the patch. (Hint: > as soon as you consider more than one table at a time, it doesn't work, > even ignoring the question of inconsistent reads.) Why does it not work ? Are you suggesting, that another backend can somegow see only some bits of page number being written ? What problems do you see in multiple table case ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
> To open up the implementation topic: > > My current patch starts a new sequential scan on a given relation at the > page of an already-running scan. I think it should start earlier to make use of the already cached part of the table. It is hard to determine, or guess how much is still in cache though. As a start you could maybe use some (small to be safe) fraction of effective_cache_size or shared_buffers. Andreas
Hannu Krosing wrote: > Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane: >> Jeff Davis <pgsql@j-davis.com> writes: >>> Since I am not storing any pointers, and since the information is only >>> really a hint, I don't need to do any locking on that page. >> If you think that, you need not bother to submit the patch. (Hint: >> as soon as you consider more than one table at a time, it doesn't work, >> even ignoring the question of inconsistent reads.) > > Why does it not work ? > > Are you suggesting, that another backend can somegow see only some bits > of page number being written ? > > What problems do you see in multiple table case ? You need to manage adding and removing relations from the shared memory structure. Which typically needs locking. Assuming that relations are added or removed relatively seldom, you might get away with a table of (Oid, BlockNumber) pairs, working around the fact that the table might get messed up every now and then, and when it does, you'll lose the benefits until it gets corrected. But it seems really messy to me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, T, 2006-12-05 kell 10:38, kirjutas Heikki Linnakangas: > Hannu Krosing wrote: > > Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane: > >> Jeff Davis <pgsql@j-davis.com> writes: > >>> Since I am not storing any pointers, and since the information is only > >>> really a hint, I don't need to do any locking on that page. > >> If you think that, you need not bother to submit the patch. (Hint: > >> as soon as you consider more than one table at a time, it doesn't work, > >> even ignoring the question of inconsistent reads.) > > > > Why does it not work ? > > > > Are you suggesting, that another backend can somegow see only some bits > > of page number being written ? > > > > What problems do you see in multiple table case ? > > You need to manage adding and removing relations from the shared memory > structure. Which typically needs locking. My impression was, that Jeff planned to do simple hashing to one of N values and to write current page number there, maybe not even page nr but each M-th page number. Assuming that writing a 4byte page number is atomic op, then there is no need for locking > Assuming that relations are added or removed relatively seldom, you > might get away with a table of (Oid, BlockNumber) pairs, working around > the fact that the table might get messed up every now and then, and when > it does, you'll lose the benefits until it gets corrected. But it seems > really messy to me. Just storing page number at offset is cheap (and imho nice and clean too). The worst that can happen, is a hash collision, in which case you lose the benefits of sync scans, but you wont degrade compared to non-sync scans -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing wrote: > The worst that can happen, is a hash collision, in which case you lose > the benefits of sync scans, but you wont degrade compared to non-sync > scans But it could cause "mysterious" performance regressions, no? Image that your app includes two large tables, which are both scannen frequently. Suppose that synchronous scanning gives this use-case a noticeable performance boost. Now, you dump and reload your schema, and suddently the hashes of oids of those tables collide. You percieve a noticeable drop in performance that you can neither explain nor fix without a rather deep understanding of postgres internals. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Hannu Krosing wrote: >> The worst that can happen, is a hash collision, in which case you lose >> the benefits of sync scans, but you wont degrade compared to non-sync >> scans > But it could cause "mysterious" performance regressions, no? There are other issues for the "no lock" approach that Jeff proposes. Suppose that we have three or four processes that are actually doing synchronized scans of the same table. They will have current block numbers that are similar but probably not identical. They will all be scribbling on the same hashtable location. So if another process comes along to join the "pack", it might get the highest active block number, or the lowest, or something in between. Even discounting the possibility that it gets random bits because it managed to read the value non-atomically, how well is that really going to work? Another thing that we have to consider is that the actual block read requests will likely be distributed among the "pack leaders", rather than all being issued by one process. AFAIK this will destroy the kernel's ability to do read-ahead, because it will fail to recognize that sequential reads are being issued --- any single process is *not* reading sequentially, and I think that read-ahead scheduling is generally driven off single-process behavior rather than the emergent behavior of the whole system. (Feel free to contradict me if you've actually read any kernel code that does this.) It might still be better than unsynchronized reads, but it'd be leaving a lot on the table. regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> Hannu Krosing wrote: >>> The worst that can happen, is a hash collision, in which case you lose >>> the benefits of sync scans, but you wont degrade compared to non-sync >>> scans > >> But it could cause "mysterious" performance regressions, no? > > There are other issues for the "no lock" approach that Jeff proposes. > Suppose that we have three or four processes that are actually doing > synchronized scans of the same table. They will have current block > numbers that are similar but probably not identical. They will all be > scribbling on the same hashtable location. So if another process comes > along to join the "pack", it might get the highest active block number, > or the lowest, or something in between. Even discounting the possibility > that it gets random bits because it managed to read the value > non-atomically, how well is that really going to work? Could that be adressed by something along the line of "Don't update the block number if the new value is between the current number and the current number - update count/2" ( Update count being the number of blocks a backend reads before updating the hashtable again). At least this would prefent needless updates to nearly the same block number, which would help NUMA-Style architectures like Opteron Systems I think. > Another thing that we have to consider is that the actual block read > requests will likely be distributed among the "pack leaders", rather > than all being issued by one process. AFAIK this will destroy the > kernel's ability to do read-ahead, because it will fail to recognize > that sequential reads are being issued --- any single process is *not* > reading sequentially, and I think that read-ahead scheduling is > generally driven off single-process behavior rather than the emergent > behavior of the whole system. (Feel free to contradict me if you've > actually read any kernel code that does this.) It might still be better > than unsynchronized reads, but it'd be leaving a lot on the table. I don't see why a single process wouldn't be reading sequentially - As far as I understood the original proposal, the current blocknumber from the hashtable is only used as a starting point for sequential scans. After that, each backend reads sequentiall until the end of the table I believe, no? greetings, Florian Pflug
Florian G. Pflug wrote: > Tom Lane wrote: >> There are other issues for the "no lock" approach that Jeff proposes. >> Suppose that we have three or four processes that are actually doing >> synchronized scans of the same table. They will have current block >> numbers that are similar but probably not identical. They will all be >> scribbling on the same hashtable location. So if another process comes >> along to join the "pack", it might get the highest active block number, >> or the lowest, or something in between. Even discounting the possibility >> that it gets random bits because it managed to read the value >> non-atomically, how well is that really going to work? It might in fact work quite well. Assuming that all the active blocks are in memory, the process that joins the pack will quickly catch up with the rest. I'd like to see some testing to be sure, though... >> Another thing that we have to consider is that the actual block read >> requests will likely be distributed among the "pack leaders", rather >> than all being issued by one process. AFAIK this will destroy the >> kernel's ability to do read-ahead, because it will fail to recognize >> that sequential reads are being issued --- any single process is *not* >> reading sequentially, and I think that read-ahead scheduling is >> generally driven off single-process behavior rather than the emergent >> behavior of the whole system. (Feel free to contradict me if you've >> actually read any kernel code that does this.) It might still be better >> than unsynchronized reads, but it'd be leaving a lot on the table. > I don't see why a single process wouldn't be reading sequentially - As far > as I understood the original proposal, the current blocknumber from the > hashtable is only used as a starting point for sequential scans. After > that, > each backend reads sequentiall until the end of the table I believe, no? When the read is satisfies from shared mem cache, it won't make it to the kernel. From the kernel point of view, the pattern looks something like this: A 1--4--7-9 B -2---6--- C --3-5--8- where letters denote processes, and numbers are block numbers read, and time goes from left to right. When you look at one process at a time, the pattern looks random, though it's constantly moving forward. It's only when you look at all the processes together that you see that it's sequential. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Florian G. Pflug wrote: >> I don't see why a single process wouldn't be reading sequentially - As far >> as I understood the original proposal, the current blocknumber from the >> hashtable is only used as a starting point for sequential scans. After >> that, >> each backend reads sequentiall until the end of the table I believe, no? > When the read is satisfies from shared mem cache, it won't make it to > the kernel. Right, and the *whole point* of this proposal is that only one of the N processes doing a synchronized scan actually does a read of any particular block. The problem is that they're not synchronized well enough to ensure that it's always the same one. It strikes me that there's still another thing we'd have to deal with to make this work nicely. If you have N processes doing a synchronized scan, then each block that reaches shared memory is going to be hit N times in fairly short succession --- which is going to be enough to convince the bufmgr to keep it in memory for awhile. Thus a synchronized seqscan is likely to end up flushing buffer cache in a way that independent seqscans could not. This could probably be dealt with by changing the ReadBuffer API to allow the caller to say "don't increment the refcount on this page", or some such. But it's some more work to be done if we intend to take this idea seriously. regards, tom lane
On Mon, 2006-12-04 at 21:46 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Since I am not storing any pointers, and since the information is only > > really a hint, I don't need to do any locking on that page. > > If you think that, you need not bother to submit the patch. (Hint: > as soon as you consider more than one table at a time, it doesn't work, > even ignoring the question of inconsistent reads.) > I believe I accounted for that case correctly. In the unlikely event that it gets a wrong hint value from the table, it would either: (1) See that the value is larger than rs_nblocks and start from 0 like normal (2) Not know that the value is wrong because it is a valid block for that relation, and it would use it anyway. In the case of #2, everything should be fine, because an arbitrary value is no worse than the arbitrary value of 0 we use right now. I could always store the Oid in the table also, so that #2 wouldn't happen unless the table was truncated (by TRUNCATE, CLUSTER, or VACUUM FULL) after a scan. Regards,Jeff Davis
On Tue, 2006-12-05 at 10:38 +0000, Heikki Linnakangas wrote: > Hannu Krosing wrote: > > Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane: > >> Jeff Davis <pgsql@j-davis.com> writes: > >>> Since I am not storing any pointers, and since the information is only > >>> really a hint, I don't need to do any locking on that page. > >> If you think that, you need not bother to submit the patch. (Hint: > >> as soon as you consider more than one table at a time, it doesn't work, > >> even ignoring the question of inconsistent reads.) > > > > Why does it not work ? > > > > Are you suggesting, that another backend can somegow see only some bits > > of page number being written ? > > > > What problems do you see in multiple table case ? > > You need to manage adding and removing relations from the shared memory > structure. Which typically needs locking. > > Assuming that relations are added or removed relatively seldom, you > might get away with a table of (Oid, BlockNumber) pairs, working around > the fact that the table might get messed up every now and then, and when > it does, you'll lose the benefits until it gets corrected. But it seems > really messy to me. > I don't think it's messy if we think of it as a hint system. I think my problem is that I am calling it Synchronized Scanning, which is a misnomer because I am not enforcing the synchronization. I call it that because that's what Simon Riggs called it back when I first suggested it, but in reality it's just a hint system. I see that right now we start every scan at 0, which is a completely arbitrary value. I am just trying to add a little intelligence to it with the hint, even if sometimes (in very rare circumstances) the intelligence turns out to be just as dumb as picking the value 0. That being said, I can just lock the hint table (the shared memory hint table, not the relation) and only update the hint every K pages, as Niel Conway suggested when I first proposed it. If we find a K small enough so the feature is useful, but large enough that we're sure there won't be contention, this might be a good option. However, I don't know that we would eliminate the contention, because if K is a constant (rather than random), the backends would still all want to update that shared memory table at the same time. Regards,Jeff Davis
On Tue, 2006-12-05 at 10:49 -0500, Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: > > Hannu Krosing wrote: > >> The worst that can happen, is a hash collision, in which case you lose > >> the benefits of sync scans, but you wont degrade compared to non-sync > >> scans > > > But it could cause "mysterious" performance regressions, no? > > There are other issues for the "no lock" approach that Jeff proposes. > Suppose that we have three or four processes that are actually doing > synchronized scans of the same table. They will have current block > numbers that are similar but probably not identical. They will all be > scribbling on the same hashtable location. So if another process comes > along to join the "pack", it might get the highest active block number, > or the lowest, or something in between. Even discounting the possibility > that it gets random bits because it managed to read the value > non-atomically, how well is that really going to work? > That's an empirical question. I expect it will work, since any active scan will have a significant cache trail behind it. > Another thing that we have to consider is that the actual block read > requests will likely be distributed among the "pack leaders", rather > than all being issued by one process. AFAIK this will destroy the > kernel's ability to do read-ahead, because it will fail to recognize > that sequential reads are being issued --- any single process is *not* > reading sequentially, and I think that read-ahead scheduling is > generally driven off single-process behavior rather than the emergent > behavior of the whole system. (Feel free to contradict me if you've > actually read any kernel code that does this.) It might still be better > than unsynchronized reads, but it'd be leaving a lot on the table. > That's a very interesting point. I had assumed read-ahead was at the kernel level without really caring what processes issued the requests, but it may be system-dependent. I think that's what the elevator (or I/O scheduler, or whatever it's called) is supposed to do. I'll see if I can find some relevant source code in a Linux or FreeBSD box. The controller certainly wouldn't care about process IDs, however. Regards,Jeff Davis
On Tue, 2006-12-05 at 11:45 -0500, Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > > Florian G. Pflug wrote: > >> I don't see why a single process wouldn't be reading sequentially - As far > >> as I understood the original proposal, the current blocknumber from the > >> hashtable is only used as a starting point for sequential scans. After > >> that, > >> each backend reads sequentiall until the end of the table I believe, no? > > > When the read is satisfies from shared mem cache, it won't make it to > > the kernel. > > Right, and the *whole point* of this proposal is that only one of the N > processes doing a synchronized scan actually does a read of any > particular block. The problem is that they're not synchronized well > enough to ensure that it's always the same one. If readahead is per-process (rather than for the entire system), my implementation would probably fall short. I would like to try to find out for sure whether this is the case, or not, or whether it's system- dependent. > It strikes me that there's still another thing we'd have to deal with > to make this work nicely. If you have N processes doing a synchronized > scan, then each block that reaches shared memory is going to be hit N > times in fairly short succession --- which is going to be enough to > convince the bufmgr to keep it in memory for awhile. Thus a > synchronized seqscan is likely to end up flushing buffer cache in a way > that independent seqscans could not. Interesting. That may be an important consideration. If a bunch of backends read the block though, I don't see it as a major loss if it hangs around to see if one more backend needs it. > This could probably be dealt with by changing the ReadBuffer API to > allow the caller to say "don't increment the refcount on this page", > or some such. But it's some more work to be done if we intend to > take this idea seriously. > Thank you for the input. Regards,Jeff Davis
On Tue, 2006-12-05 at 15:54 +0100, Florian G. Pflug wrote: > Hannu Krosing wrote: > > The worst that can happen, is a hash collision, in which case you lose > > the benefits of sync scans, but you wont degrade compared to non-sync > > scans > > But it could cause "mysterious" performance regressions, no? > Image that your app includes two large tables, which are both > scannen frequently. Suppose that synchronous scanning gives this > use-case a noticeable performance boost. Now, you dump and reload > your schema, and suddently the hashes of oids of those tables > collide. You percieve a noticeable drop in performance that you > can neither explain nor fix without a rather deep understanding > of postgres internals. > A good point. We can hopefully make this relatively rare with a decent hashing algorithm (right now I just mod by the table size), and a reasonable-sized table. For your problem to occur, you'd need two relations which are both scanned very frequently at the same time and also have a hash collision. We can mitigate the problem by not reporting to the table unless the table is a minimum size (perhaps related to effective_cache_size), so that tables that are in memory anyway don't write over another table's hint. Or, we could use a dynamic structure, use locking, and only write a hint every K pages, or something similar. Regards,Jeff Davis
On Tue, 2006-12-05 at 11:28 +0100, Zeugswetter Andreas ADI SD wrote: > > To open up the implementation topic: > > > > My current patch starts a new sequential scan on a given relation at > the > > page of an already-running scan. > > I think it should start earlier to make use of the already cached part > of the table. > It is hard to determine, or guess how much is still in cache though. > As a start you could maybe use some (small to be safe) fraction of > effective_cache_size > or shared_buffers. Interesting idea, however it's a little tricky. For instance, we don't want to start before the oldest cached page, or we have lost the benefit (just as if we got a bad hint). It also depends on a lot of other parts of the code, like how scan-resistant the shared buffers are, and how scan-resistant the OS buffer cache is. Thoughts? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2006-12-05 at 11:45 -0500, Tom Lane wrote: >> ... If you have N processes doing a synchronized >> scan, then each block that reaches shared memory is going to be hit N >> times in fairly short succession --- which is going to be enough to >> convince the bufmgr to keep it in memory for awhile. Thus a >> synchronized seqscan is likely to end up flushing buffer cache in a way >> that independent seqscans could not. > Interesting. That may be an important consideration. If a bunch of > backends read the block though, I don't see it as a major loss if it > hangs around to see if one more backend needs it. Sure, it should hang around for awhile, and will. The problem is that its lifetime will be artificially inflated, so that the seqscan ends up kicking out other blocks that are really of greater importance, rather than recycling its own old blocks as it should. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Sure, it should hang around for awhile, and will. The problem is that > its lifetime will be artificially inflated, so that the seqscan ends up > kicking out other blocks that are really of greater importance, rather > than recycling its own old blocks as it should. I thought you had switched this all to a clock sweep algorithm. The clock sweep method is supposed to be especially resistant to this since it doesn't really matter how many times something is accessed, only whether it has been accessed recently. As long as all the synchronized scans get their work done before the clock comes around the block will be recycled the next time the clock sweeps around and it finds nobody else is interested in that block. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Sure, it should hang around for awhile, and will. The problem is that >> its lifetime will be artificially inflated, so that the seqscan ends up >> kicking out other blocks that are really of greater importance, rather >> than recycling its own old blocks as it should. > I thought you had switched this all to a clock sweep algorithm. Yeah ... it's a clock sweep with counter. A buffer's counter is incremented by each access and decremented when the sweep passes over it. So multiple accesses allow the buffer to survive longer. For a large seqscan you really would rather the counter stayed at zero, because you want the buffers to be recycled when the sweep comes back the first time. regards, tom lane
On Tue, Dec 05, 2006 at 09:09:39AM -0800, Jeff Davis wrote: > That being said, I can just lock the hint table (the shared memory hint > table, not the relation) and only update the hint every K pages, as Niel > Conway suggested when I first proposed it. If we find a K small enough > so the feature is useful, but large enough that we're sure there won't > be contention, this might be a good option. However, I don't know that > we would eliminate the contention, because if K is a constant (rather > than random), the backends would still all want to update that shared > memory table at the same time. What about some algorithm where only one backend will update the hint entry (perhaps the first one, or the slowest one (ie: lowest page number))? ISTM that would eliminate a lot of contention, and if you get clever with the locking scheme you could probably allow other backends to do non-blocking reads except when the page number passes a 4-byte value (assuming 4-byte int updates are atomic). Granted, coming up with a way to put one backend in charge of this is non-trivial, but it's another option. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, 2006-12-05 at 13:25 -0500, Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: > > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> Sure, it should hang around for awhile, and will. The problem is that > >> its lifetime will be artificially inflated, so that the seqscan ends up > >> kicking out other blocks that are really of greater importance, rather > >> than recycling its own old blocks as it should. > > > I thought you had switched this all to a clock sweep algorithm. > > Yeah ... it's a clock sweep with counter. A buffer's counter is > incremented by each access and decremented when the sweep passes over > it. So multiple accesses allow the buffer to survive longer. For a > large seqscan you really would rather the counter stayed at zero, > because you want the buffers to be recycled when the sweep comes back > the first time. If you focus the backends together by synchronizing them you definitely also then need to solve the problem of false cache reinforcement. I envisaged that we would handle the problem by having a large SeqScan reuse its previous buffers so it would avoid polluting the cache. If we kept track of how many backends were in link-step together (a "Conga") we would be able to check that a block had not been used by anyone but the Conga members. So we'd need rules about - when to allow a Conga to form (if file is very big we check, if not we don't, no real need for exact synchronisation in all cases) - how to join a Conga - how to leave a Conga if you fall behind The cost of synchronisation (i.e. LWlocks) is much less than the cost of non-synchronisation (i.e. lots more I/O). -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-12-06 at 19:04 +0000, Simon Riggs wrote: > I envisaged that we would handle the problem by having a large SeqScan > reuse its previous buffers so it would avoid polluting the cache. If we > kept track of how many backends were in link-step together (a "Conga") > we would be able to check that a block had not been used by anyone but > the Conga members. > Interesting idea. > So we'd need rules about > - when to allow a Conga to form (if file is very big we check, if not we > don't, no real need for exact synchronisation in all cases) > - how to join a Conga > - how to leave a Conga if you fall behind > How would one leave a Conga? What if it's in a user-defined function, or if it's using the results for a nested loop join or something? So, we can't wait for a backend to leave voluntarily, it would have to be a plan where, when the backend goes back to get another block, it realizes that the train has left the station, and goes at it alone (or makes it's own Conga). If you make the join/leave operations such that there is no resistance at all (no timeout or anything), then it becomes the same as my non- synchronized proposal, right? > The cost of synchronisation (i.e. LWlocks) is much less than the cost of > non-synchronisation (i.e. lots more I/O). > If we can have a good way to synchronize it, that sounds good to me. Regards,Jeff Davis
On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote: > If you make the join/leave operations such that there is no resistance > at all (no timeout or anything), then it becomes the same as my non- > synchronized proposal, right? Teamwork requires some synchronisation to be effective, but yeh there needs to be a way to leave the Conga if its not working for you/them. I think we need the synchronisation to make concurrent scans effective, plus Brownian Scans doesnt have the same ring to it :-) I'm still willing to help if you're willing to take this further. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote: >> If you make the join/leave operations such that there is no resistance >> at all (no timeout or anything), then it becomes the same as my non- >> synchronized proposal, right? > Teamwork requires some synchronisation to be effective, but yeh there > needs to be a way to leave the Conga if its not working for you/them. Seems like an unworkably complicated mess :-(. Error cases alone would be a nightmare. The good thing about Jeff's proposal is that it's very robust --- there isn't anything you have to clean up if you abort a scan. I think all we need as far as buffer management goes is what I suggested before, namely have seqscans on large tables tell bufmgr not to increment the usage counter for their accesses. If it stays zero then the buffers will be recycled as soon as the sweep gets back to them, which is exactly what we want. The window for additional backends to pick up on the sync scan is then however much of shared_buffers aren't occupied by blocks being accessed normally. If we have syncscan members that are spread out over any significant range of the table's blocks, then the problem of setting the hint properly becomes a lot more pressing. We'd like incoming joiners to start at a fairly low block number, ie not become one of the "pack leaders" but one of the "trailers" --- since the higher they start, the more blocks they'll need to re-read at the end of their cycles, yet those are blocks they could have had "for free" if they'd started as a trailer. I don't see any cheap way to bias the behavior in that direction, though. regards, tom lane
On Wed, 2006-12-06 at 12:48 -0600, Jim C. Nasby wrote: > On Tue, Dec 05, 2006 at 09:09:39AM -0800, Jeff Davis wrote: > > That being said, I can just lock the hint table (the shared memory hint > > table, not the relation) and only update the hint every K pages, as Niel > > Conway suggested when I first proposed it. If we find a K small enough > > so the feature is useful, but large enough that we're sure there won't > > be contention, this might be a good option. However, I don't know that > > we would eliminate the contention, because if K is a constant (rather > > than random), the backends would still all want to update that shared > > memory table at the same time. > > What about some algorithm where only one backend will update the hint > entry (perhaps the first one, or the slowest one (ie: lowest page > number))? ISTM that would eliminate a lot of contention, and if you get > clever with the locking scheme you could probably allow other backends > to do non-blocking reads except when the page number passes a 4-byte > value (assuming 4-byte int updates are atomic). > If we have one backend in charge, how does it pass the torch when it finishes the scan? I think you're headed back in the direction of an independent "scanning" process. That's not unreasonable, but there are a lot of other issues to deal with. One thought of mine goes something like this: A scanner process starts up and scans with a predefined length of a cache trail in the shared_buffers, perhaps a chunk of buffers used like a circular list (so it doesn't interfere with caching). When a new scan starts, it could request a block from this scanner process and begin the scan there. If the new scan keeps up with the scanner process, it will always be getting cached data. If it falls behind, the request turns into a new block request. In theory, the scan could actually catch back up to the scanner process after falling behind. We could use a meaningful event (like activity on a particular relation) to start/stop the scanner process. It's just another idea, but I'm still not all that sure that synchronization is necessary. Does anyone happen to have an answer on whether OS-level readahead is system-wide, or per-process? I expect that it's system wide, but Tom raised the issue and it may be a drawback if some OSs do per-process readahead. Regards,Jeff Davis
On Wed, 2006-12-06 at 19:55 +0000, Simon Riggs wrote: > On Wed, 2006-12-06 at 11:46 -0800, Jeff Davis wrote: > > > If you make the join/leave operations such that there is no resistance > > at all (no timeout or anything), then it becomes the same as my non- > > synchronized proposal, right? > > Teamwork requires some synchronisation to be effective, but yeh there > needs to be a way to leave the Conga if its not working for you/them. > > I think we need the synchronisation to make concurrent scans effective, > plus Brownian Scans doesnt have the same ring to it :-) How about "Scan Hinting", or "Smart Scans"? Although, "smart" might be too strong a word ;) Regards,Jeff Davis
On Dec 7, 2006, at 8:01 , Jeff Davis wrote: > On Wed, 2006-12-06 at 19:55 +0000, Simon Riggs wrote: >> Teamwork requires some synchronisation to be effective, but yeh there >> needs to be a way to leave the Conga if its not working for you/them. >> >> I think we need the synchronisation to make concurrent scans >> effective, >> plus Brownian Scans doesnt have the same ring to it :-) > > How about "Scan Hinting", or "Smart Scans"? I must admit, I think "Conga Scan" has a certain ring to it :) Michael Glaesemann grzm seespotcode net
On Wed, 2006-12-06 at 15:12 -0500, Tom Lane wrote: > I think all we need as far as buffer management goes is what I suggested > before, namely have seqscans on large tables tell bufmgr not to > increment the usage counter for their accesses. If it stays zero then > the buffers will be recycled as soon as the sweep gets back to them, > which is exactly what we want. This is good, yet it addresses only the non-cache spoiling behaviour. BTW, we would need something special to spot and Append node with multiple SeqScans occurring on partitioned tables in sequence. Individual scans may not be that large but overall the set could be huge. There's not much point implementing behaviour such as "table must be bigger than 10x shared_buffers" because it works directly against the role of partitioning. > The window for additional backends to > pick up on the sync scan is then however much of shared_buffers aren't > occupied by blocks being accessed normally. Non-cache spoiling means window reduction, so you can't catch it by chance. > If we have syncscan members that are spread out over any significant > range of the table's blocks, then the problem of setting the hint > properly becomes a lot more pressing. We'd like incoming joiners to > start at a fairly low block number, ie not become one of the "pack > leaders" but one of the "trailers" --- since the higher they start, > the more blocks they'll need to re-read at the end of their cycles, > yet those are blocks they could have had "for free" if they'd started > as a trailer. I don't see any cheap way to bias the behavior in that > direction, though. Well that's the problem. Agree about the pack leaders/trailers. What's the solution? You can synchronise every block or every N blocks, but otherwise: how will you know the optimal point to start the scan? That will require *some* synchronisation to be optimal. Most scans don't go at the same rate naturally. Different plans do different amounts of work between page requests. Allowing them to go their own individual ways would be very wasteful of I/O resources, so making some people wait for others is an essential aspect of efficiency, just like it is with the OS. Synchronisation costs very little in comparison with the I/O it saves. Perhaps there are ways of doing this without central control, so that backend error conditions don't need to be considered. Seems like a simple timeout would be sufficient to exclude backends from the Conga, which would be sufficient to handle error cases and special cases. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-12-06 at 15:12 -0500, Tom Lane wrote: > > Teamwork requires some synchronisation to be effective, but yeh there > > needs to be a way to leave the Conga if its not working for you/them. > > If we have syncscan members that are spread out over any significant > range of the table's blocks, then the problem of setting the hint > properly becomes a lot more pressing. We'd like incoming joiners to > start at a fairly low block number, ie not become one of the "pack > leaders" but one of the "trailers" --- since the higher they start, > the more blocks they'll need to re-read at the end of their cycles, > yet those are blocks they could have had "for free" if they'd started > as a trailer. I don't see any cheap way to bias the behavior in that > direction, though. > This is speculation, but I suspect that there already is somewhat of a bias favoring the lower block numbers. If you have a pack near the head, and a couple trailers 1000 blocks behind, the trailers are likely to be moving through those blocks very quickly -- and reporting their location much more often -- than the blocks at the head that are awaiting a disk fetch. Even if there are 50 in the pack, and 2 trailing, at any point in time it's more likely that the last block number reported was reported by a trailing scan. The pack might all report at once when they finally get the block, but will be promptly overwritten by the continuous stream of reports from trailing scans. However, if my analysis was really true, one might wonder how those scans got that far behind in the first place. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Even if there are 50 in the pack, and 2 trailing, at any point in time > it's more likely that the last block number reported was reported by a > trailing scan. The pack might all report at once when they finally get > the block, but will be promptly overwritten by the continuous stream of > reports from trailing scans. > However, if my analysis was really true, one might wonder how those > scans got that far behind in the first place. Yah. Something I was idly wondering about: suppose we teach ReadBuffer to provide an indication whether it had to issue an actual read() or found the block in cache? Could it be useful to not report the block location to the hint area if we had to actually read()? That would eliminate the immediate "pack leader" from the equation. The problem is that it seems to break things for the case of the first follower joining a seqscan, because the original leader would never report. Anyone see the extra idea needed to make this work? regards, tom lane
On Thu, Dec 07, 2006 at 12:46:34AM -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Even if there are 50 in the pack, and 2 trailing, at any point in time > > it's more likely that the last block number reported was reported by a > > trailing scan. The pack might all report at once when they finally get > > the block, but will be promptly overwritten by the continuous stream of > > reports from trailing scans. > > > However, if my analysis was really true, one might wonder how those > > scans got that far behind in the first place. > > Yah. Something I was idly wondering about: suppose we teach ReadBuffer > to provide an indication whether it had to issue an actual read() or > found the block in cache? Could it be useful to not report the block > location to the hint area if we had to actually read()? That would > eliminate the immediate "pack leader" from the equation. The problem > is that it seems to break things for the case of the first follower > joining a seqscan, because the original leader would never report. > Anyone see the extra idea needed to make this work? The first reader won't find an entry in shared memory, so it will know it's the first. It could then either always update, or it could check to see if anyone else has updated shared memory behind it's back; at that point it could switch to only updating on an actual read. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> Yah. Something I was idly wondering about: suppose we teach ReadBuffer > to provide an indication whether it had to issue an actual read() or > found the block in cache? Could it be useful to not report the block > location to the hint area if we had to actually read()? That would > eliminate the immediate "pack leader" from the equation. The problem > is that it seems to break things for the case of the first follower > joining a seqscan, because the original leader would never report. > Anyone see the extra idea needed to make this work? 1) How about 2 location hints, one for the actual reads, and one for the followers ? The actual read() case will write to the one, the followers to the other. There could be some magic to make the follower which is getting close to the pack leader to report less often, so that the follower hint will still be covering cached buffers far behind enough (tricky business to tune this), and doesn't skip them immediately. Then some magic to figure out which one to follow... or even try both (at the start of the scan only), and go with the one you had no actual read() on, or the smaller one if both had read() or both didn't. 2) Make the pack leader (as in it had to do actual read()) report less often than the followers (say every 100 read(), while followers report every 10 - numbers made up, no idea what would work best here). This will make the followers hint be more likely to be picked up by new scans, thus better using the existing cache... Cheers, Csaba.
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> Even if there are 50 in the pack, and 2 trailing, at any point in time >> it's more likely that the last block number reported was reported by a >> trailing scan. The pack might all report at once when they finally get >> the block, but will be promptly overwritten by the continuous stream of >> reports from trailing scans. > >> However, if my analysis was really true, one might wonder how those >> scans got that far behind in the first place. > > Yah. Something I was idly wondering about: suppose we teach ReadBuffer > to provide an indication whether it had to issue an actual read() or > found the block in cache? Could it be useful to not report the block > location to the hint area if we had to actually read()? That would > eliminate the immediate "pack leader" from the equation. The problem > is that it seems to break things for the case of the first follower > joining a seqscan, because the original leader would never report. > Anyone see the extra idea needed to make this work? What if there were two blocknumbers (last_disk_read_blocknr, and last_cache_read_blocknr) stored per table, together with a timestamp telling when the last updated occured? Values older than let's say a second or so would be treated as "outdated". If last_cache_read_blocknr isn't outdated, it would be used as a starting point for seqscans, otherwise last_disk_read_blocknr would be used if that one isn't outdated. If both are outdates, it would start at the lower of the two blocknumbers. greetings, Florian Pflug
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2006-12-05 at 13:25 -0500, Tom Lane wrote: >> Gregory Stark <stark@enterprisedb.com> writes: >> > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> >> Sure, it should hang around for awhile, and will. The problem is that >> >> its lifetime will be artificially inflated, so that the seqscan ends up >> >> kicking out other blocks that are really of greater importance, rather >> >> than recycling its own old blocks as it should. >> >> > I thought you had switched this all to a clock sweep algorithm. >> >> Yeah ... it's a clock sweep with counter. A buffer's counter is >> incremented by each access and decremented when the sweep passes over >> it. So multiple accesses allow the buffer to survive longer. For a >> large seqscan you really would rather the counter stayed at zero, >> because you want the buffers to be recycled when the sweep comes back >> the first time. > > If you focus the backends together by synchronizing them you definitely > also then need to solve the problem of false cache reinforcement. Well the clock sweep algorithm is inherently resistant to this. The classical algorithm only had a single bit to indicate whether the buffer had been touched since hand came around last. If you have a counter you just need to limit the size of that counter. The larger the maximum value of the counter the slower the algorithm will be to learn new access patterns. The smaller the counter is the shorter its memory will be. I do vaguely recall discussing having a counter in school but IIRC it was just a two bit counter which the clock sweep shifted right. So a buffer hit by a synchronized would require two sweeps to get flushed instead of one but that would be true regardless of how many synchronized scans hit it. I think the no-counter (ie, single bit of state) is probably exactly what you want in this case. If there are synchronized scans running there may be more coming along behind you. You do want to keep those buffers around long enough to be used by them. If there aren't any synchronized scans running then you probably want to just keep reusing the same buffers which is what will happen when the hand comes around the next time if nobody else has looked at. It may be enough to just cap the use-counter at different values when you increment it depending on the type of access. In a sequential scan don't increment it beyond 1 (ie, just a single bit of state). In a random I/O increment it up to 2. It may also be useful to have different initial use values. Or maybe it makes more sense to look at initial values. If you load a buffer in a sequential scan perhaps its use count could start at 0 whereas random i/o could start at 1. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Csaba Nagy" <nagy@ecircle-ag.com> writes: > 1) How about 2 location hints, one for the actual reads, and one for the > followers ? Having the location of the pack leader may be useful for another reason too: It would give the trailers a chance to avoid competing with the pack leader. I would expect no matter how far back they start they'll quickly catch up to the pack leader since they're only doing cached reads and the pack leader is having to do actual i/o. When they catch up they're going to compete for the lead which may a) kill the kernel read-ahead and b) cause contention. If they know where the oldest trailer is and where the leader is then when they catch up they could sleep for a small amount of time to give the leader a chance to make progress. I would suggest they should only sleep once and if they catch up again they should try to take over and become the new leader. If the leader ever finds its reading a cached block and someone else has passed it it should switch to follower behaviour. That way the leadership does change hands periodically but not on every read. Just in case the first leader is, say, running a nested loop or something we don't want the follower to be unnecessarily limited to its speed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Matthew O'Connor wrote: > > Could we have a counter in shared memory that keeps a count on the > number seq scanners currently working on a table? While count = 1, we > report all blocks regardless if it's from disk or from buffer. Once > count > 1 we only report buffer reads. Would this have locking problems? I don't know, but it would require some extra care to make sure that the counter is decremented when a for example a transaction is aborted, or a backend is killed. > BTW, it seems to me that this is all based on the assumption that > followers will have no problem keeping up with the pack leader. Suppose > my process does a lot of other processing and can't keep up with the > pack despite the fact that it's getting all it's data from the buffer. > Now we have effectively have two different seq scans going on. Does my > process need to recognize that it's not keeping up and not report it's > blocks? That's what I was wondering about all these schemes as well. What we could do, is that instead of doing a sequential scan, each backend keeps a bitmap of pages it has processed during the scan, and read the pages in the order they're available in cache. If a backend misses a page in the synchronized scan, for example, it could issue a random I/O after reading all the other pages to fetch it separately, instead of starting another seq scan at a different location and "derailing the train". I don't know what the exact algorithm used to make decisions on when and how to fetch each page would be, but the bitmaps would be in backend private memory. And presumably it could be used with bitmap heap scans as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
>>> On Thu, Dec 7, 2006 at 4:57 AM, in message <874ps8km81.fsf@enterprisedb.com>, Gregory Stark <stark@enterprisedb.com> wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> >> If you focus the backends together by synchronizing them you definitely >> also then need to solve the problem of false cache reinforcement. > > Well the clock sweep algorithm is inherently resistant to this. The > classical > algorithm only had a single bit to indicate whether the buffer had been > touched since hand came around last. > > If you have a counter you just need to limit the size of that counter. The > larger the maximum value of the counter the slower the algorithm will be to > learn new access patterns. The smaller the counter is the shorter its memory > will be. > > I do vaguely recall discussing having a counter in school but IIRC it was > just > a two bit counter which the clock sweep shifted right. So a buffer hit by a > synchronized would require two sweeps to get flushed instead of one but that > would be true regardless of how many synchronized scans hit it. > > I think the no- counter (ie, single bit of state) is probably exactly what you > want in this case. If there are synchronized scans running there may be more > coming along behind you. You do want to keep those buffers around long > enough > to be used by them. If there aren't any synchronized scans running then you > probably want to just keep reusing the same buffers which is what will > happen > when the hand comes around the next time if nobody else has looked at. > > It may be enough to just cap the use- counter at different values when you > increment it depending on the type of access. In a sequential scan don't > increment it beyond 1 (ie, just a single bit of state). In a random I/O > increment it up to 2. It may also be useful to have different initial use > values. Or maybe it makes more sense to look at initial values. If you load > a > buffer in a sequential scan perhaps its use count could start at 0 whereas > random i/o could start at 1. For what little it's worth, the algorithm which tested out best for me in a similar situation in a database I wrote whichwas used in production environments for decades was to use a 16 bit use counter which was effectively multiplied by0.75 (with shifts and an add) in a sweep triggered by an I/O count. I know we tried different multiples of the bufferslot count for the trigger point, to find the sweet spot; unfortunately, I don't remember where that was. This seemedto provide pretty good protection against transient activity flushing the buffer. -Kevin
On Thu, 2006-12-07 at 00:46 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Even if there are 50 in the pack, and 2 trailing, at any point in time > > it's more likely that the last block number reported was reported by a > > trailing scan. The pack might all report at once when they finally get > > the block, but will be promptly overwritten by the continuous stream of > > reports from trailing scans. > > > However, if my analysis was really true, one might wonder how those > > scans got that far behind in the first place. > > Yah. Something I was idly wondering about: suppose we teach ReadBuffer > to provide an indication whether it had to issue an actual read() or > found the block in cache? Could it be useful to not report the block > location to the hint area if we had to actually read()? That would > eliminate the immediate "pack leader" from the equation. The problem > is that it seems to break things for the case of the first follower > joining a seqscan, because the original leader would never report. > Anyone see the extra idea needed to make this work? > My initial thought is that eliminating the immediate pack leader won't do much good, because there's a good chance at least one scan is following very closely (we would hope). Also, there's a lot of focus here on not starting where the pack leader is. The assumption is that the follower will be closer to the end of a cache trail. Let's attack the problem more directly by taking the hint and subtracting a number before choosing it as the start location. Let's call the number M, which would be a fraction of the effective_cache_size. Each scan would issue no reports at all until the current page minus the starting page number for that scan was greater than M. If we choose M conservatively, there's a very high chance that there will exist a continuous stream of cached pages between where the new scan starts (the hint - M) and the head of the pack. This keeps the code very simple. I could modify my patch to do this in a few minutes. However, I do agree that it's always worth thinking about ways to use the information that we do have about what's in cache at higher layers; and also useful if higher layers can tell the caching layers what to cache or not. Regards,Jeff Davis
On Thu, 2006-12-07 at 00:32 -0600, Jim C. Nasby wrote: > On Thu, Dec 07, 2006 at 12:46:34AM -0500, Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > Even if there are 50 in the pack, and 2 trailing, at any point in time > > > it's more likely that the last block number reported was reported by a > > > trailing scan. The pack might all report at once when they finally get > > > the block, but will be promptly overwritten by the continuous stream of > > > reports from trailing scans. > > > > > However, if my analysis was really true, one might wonder how those > > > scans got that far behind in the first place. > > > > Yah. Something I was idly wondering about: suppose we teach ReadBuffer > > to provide an indication whether it had to issue an actual read() or > > found the block in cache? Could it be useful to not report the block > > location to the hint area if we had to actually read()? That would > > eliminate the immediate "pack leader" from the equation. The problem > > is that it seems to break things for the case of the first follower > > joining a seqscan, because the original leader would never report. > > Anyone see the extra idea needed to make this work? > > The first reader won't find an entry in shared memory, so it will know > it's the first. It could then either always update, or it could check to That requires that scans clear the hint when they finish, right? I don't think that would be difficult, but my current patch doesn't do that. > see if anyone else has updated shared memory behind it's back; at that > point it could switch to only updating on an actual read. Interesting idea. Regards,Jeff Davis
On Thu, Dec 07, 2006 at 04:14:54PM +0000, Heikki Linnakangas wrote: > >BTW, it seems to me that this is all based on the assumption that > >followers will have no problem keeping up with the pack leader. Suppose > >my process does a lot of other processing and can't keep up with the > >pack despite the fact that it's getting all it's data from the buffer. > >Now we have effectively have two different seq scans going on. Does my > >process need to recognize that it's not keeping up and not report it's > >blocks? > > That's what I was wondering about all these schemes as well. What we > could do, is that instead of doing a sequential scan, each backend keeps > a bitmap of pages it has processed during the scan, and read the pages > in the order they're available in cache. If a backend misses a page in > the synchronized scan, for example, it could issue a random I/O after > reading all the other pages to fetch it separately, instead of starting > another seq scan at a different location and "derailing the train". I > don't know what the exact algorithm used to make decisions on when and > how to fetch each page would be, but the bitmaps would be in backend > private memory. And presumably it could be used with bitmap heap scans > as well. First, I think that we're getting ahead of ourselves by worrying about how to deal with diverging scans right now. Having said that... There's 3 ranges of seqscan speeds we need to be concerned with [1], with 2 break-points between them: 1) Seqscan is I/O bound; it can completely keep up with incoming blocks -- Maximum single scan rate - the I/O system is saturated 2) Seqscan is CPU-bound if there's nothing else competing for I/O -- Maximum double-scan rate - maximum rate that 2 scans in different places can achieve. 3) Seqscan is slower than 2 competing category 1 scans are today [1] I'm assuming only 1 slow scan and any number of fast scans. If there are actually 2 slow scans, things will differ. If every scan is running in the first category, then there is no issue. Since all scans are completely I/O bound, they'll all process blocks as soon as they're in from the drive. If there is a scan in the 3rd category, we don't want to try and hold faster scans down to the rate of the slow scan, because that's actually slower than if we let a synchronized set of fast scans compete with the slow scan, even though there's a lot of seeking involved. The problem is if we have a scan in the 2nd category. As soon as it falls far enough behind that the system is forced to issue physical reads to disk for it, the performance of all the scans will plummet. And as the slow scan falls further and further behind, seek times will get longer and longer. To put real numbers to this, on my machine, having 2 dd's running on one file far enough apart that caching isn't helping cuts the rate from ~35MB/s to ~31MB/s (incidentally, starting a second dd about 10 seconds after the first gives the second scan a rate of 40MB/s). So on my machine, if a slow scan is doing more than 22MB/s, it's better to restrict all scans to it's speed, rather than have 2 divergent scans. OTOH, if the slow scan is doing less than 22MB/s, it's better not to hold the fast scans back. Of course, having people run that check themselves might not be terribly practical, and if there's more than one slow scan involved those measurements are likely to be meaningless. So I'm wondering if there's some technique we can use that will make this selection process more 'automatic'. The only thought I've come up with is to apply a delay to every read from the fast scans if there is a slow scan that's in danger of falling past the cache size. My theory is that if the delay is set right, then a category 2 scan will eventually catch back up, at which point we could either just let it drive the scan, or we could un-throttle the fast scans until the slow scan is in danger of falling behind again. If instead the slow scan is category 3, then even with the delay it will still fall behind. If the database can detect that situation, it can then un-throttle the fast scans and just let things diverge. Ideally, if another category 3 scan came along, we would sync it to the existing cat 3 scan. Unfortunately, that still means needing to come up with what that delay setting should be. Perhaps if we're lucky, there's a pretty fixed relationship between the maximum speed of a scan and the speed of two competing scans. If that's the case, I think we could set the delay to be a fraction of the average length of a scan. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)