Thread: Seq scans roadmap
Here's my roadmap for the "scan-resistant buffer cache" and "synchronized scans" patches. 1. Fix the current vacuum behavior of throwing dirty buffers to the freelist, forcing a lot of WAL flushes. Instead, use a backend-private ring of shared buffers that are recycled. This is what Simon's "scan-resistant buffer manager" did. The theory here is that if a page is read in by vacuum, it's unlikely to be accessed in the near future, therefore it should be recycled. If vacuum doesn't dirty the page, it's best to reuse the buffer immediately for the next page. However, if the buffer is dirty (and not just because we set hint bits), we ought to delay writing it to disk until the corresponding WAL record has been flushed to disk. Simon's patch used a fixed size ring of buffers that are recycled, but I think the ring should be dynamically sized. Start with a small ring, and whenever you need to do a WAL flush to write a dirty buffer, increase the ring size. On every full iteration through the ring, decrease its size to trim down an unnecessarily large ring. This only alters the behavior of vacuums, and it's pretty safe to say it won't get worse than what we have now. In the future, we can use the buffer ring for seqscans as well; more on that on step 3. 2. Implement the list/table of last/ongoing seq scan positions. This is Jeff's "synchronized scans" patch. When a seq scan starts on a table larger than some threshold, it starts from where the previous seq scan is currently, or where it ended. This will synchronize the scans so that for two concurrent scans the total I/O is halved in the best case. There should be no other effect on performance. If you have a partitioned table, or union of multiple tables or any other plan where multiple seq scans are performed in arbitrary order, this change won't change the order the partitions are scanned and won't therefore ensure they will be synchronized. Now that we have both pieces of the puzzle in place, it's time to consider what more we can do with them: 3A. To take advantage of the "cache trail" of a previous seq scan, scan backwards from where the previous seq scan ended, until a you hit a buffer that's not in cache. This will allow taking advantage of the buffer cache even if the table doesn't fit completely in RAM. That can make a big difference if the table size is just slightly bigger than RAM, and can avoid the nasty surprise when a table grows beyond RAM size and queries start taking minutes instead of seconds. This should be a non-controversial change on its own from performance point of view. No query should get slower, and some will become faster. But see step 3B: 3B. Currently, sequential scans on a large table spoils the buffer cache by evicting other pages from the cache. In CVS HEAD, as soon as the table is larger than shared_buffers, the pages in the buffer won't be used to speed up running the same query again, and there's no reason to believe the pages read in would be more useful than any other page in the database, and in particular the pages that were in the buffer cache before the huge seq scan. If the table being scanned is > 5 * shared_buffers, the scan will evict every other page from the cache if there's no other activity in the database (max usage_count is 5). If the table is much larger than shared_buffers, say 10 times as large, even with the change 3B to read the pages that are in cache first, using all shared_buffers to cache the table will only speed up the query by 10%. We should not spoil the cache for such a small gain, and use the local buffer ring strategy instead. It's better to make queries that are slow anyway a little bit slower, than making queries that are normally really fast, slow. As you may notice, 3A and 3B are at odds with each other. We can implement both, but you can't use both strategies in the same scan. Therefore we need to have decision logic of some kind to figure out which strategy is optimal. A simple heuristic is to decide based on the table size: < 0.1*shared_buffers -> start from page 0, keep in cache (like we do now) < 5 * shared_buffers -> start from last read page, keep in cache> 5 * shared_buffers -> start from last read page,use buffer ring I'm not sure about the constants, we might need to make them GUC variables as Simon argued, but that would be the general approach. In the future, I'm envisioning a smarter algorithm to size the local buffer ring. Take all buffers with usage_count=0, plus a few with usage_count=1 from the clock sweep. That way if there's a lot of buffers in the buffer cache that are seldomly used, we'll use more buffers to cache the large scan, and vice versa. And no matter how large the scan, it wouldn't blow all buffers from the cache. But if you execute the same query again, the buffers left in the cache from the last scan were apparently useful, so we use a bigger ring this time. I'm going to do this incrementally, and we'll see how far we get for 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up Simon's patch (step 1), run some performance tests with vacuum, and submit a patch for that. Then I'll move to Jeff's patch (step 2). Thoughts? Everyone happy with the roadmap? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki, On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc) implement dynamic I/O caching. The experiments have shown that benefit of re-using PG buffer cache on large sequential scans is vanishingly small when the buffer cache size is small compared to the system memory. Since this is a normal and recommended situation (OS I/O cache is auto-tuning and easy to administer, etc), IMO the effort to optimize buffer cache reuse for seq scans > 1 x buffer cache size is not worthwhile. On 3B: The scenario described is "multiple readers seq scanning large table and sharing bufcache", but in practice this is not a common situation. The common situation is "multiple queries joining several small tables to one or more large tables that are >> 1 x bufcache". In the common scenario, the dominant factor is the ability to keep the small tables in bufcache (or I/O cache for that matter) while running the I/O bound large table scans as fast as possible. To that point - an important factor in achieving max I/O rate for large tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache. This is commonly in the range of 512KB -> 2MB, which is only important when considering a bound on the size of the ring buffer. The effect has been demonstrated to be significant - in the 20%+ range. Another thing to consider is the use of readahead inside the heapscan, in which case sizes >= 32KB are very effective. We've implemented both ideas (ring buffer, readahead) and see very significant improvements in I/O and query speeds on DSS workloads. I would expect benefits on OLTP as well. The modifications you suggest here may not have the following properties: - don't pollute bufcache for seqscan of tables > 1 x bufcache - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to minimize L2 cache pollution - Luke > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > Heikki Linnakangas > Sent: Tuesday, May 08, 2007 3:40 AM > To: PostgreSQL-development > Cc: Jeff Davis; Simon Riggs > Subject: [HACKERS] Seq scans roadmap > > Here's my roadmap for the "scan-resistant buffer cache" and > "synchronized scans" patches. > > 1. Fix the current vacuum behavior of throwing dirty buffers > to the freelist, forcing a lot of WAL flushes. Instead, use a > backend-private ring of shared buffers that are recycled. > This is what Simon's "scan-resistant buffer manager" did. > > The theory here is that if a page is read in by vacuum, it's > unlikely to be accessed in the near future, therefore it > should be recycled. If vacuum doesn't dirty the page, it's > best to reuse the buffer immediately for the next page. > However, if the buffer is dirty (and not just because we set > hint bits), we ought to delay writing it to disk until the > corresponding WAL record has been flushed to disk. > > Simon's patch used a fixed size ring of buffers that are > recycled, but I think the ring should be dynamically sized. > Start with a small ring, and whenever you need to do a WAL > flush to write a dirty buffer, increase the ring size. On > every full iteration through the ring, decrease its size to > trim down an unnecessarily large ring. > > This only alters the behavior of vacuums, and it's pretty > safe to say it won't get worse than what we have now. In the > future, we can use the buffer ring for seqscans as well; more > on that on step 3. > > 2. Implement the list/table of last/ongoing seq scan > positions. This is Jeff's "synchronized scans" patch. When a > seq scan starts on a table larger than some threshold, it > starts from where the previous seq scan is currently, or > where it ended. This will synchronize the scans so that for > two concurrent scans the total I/O is halved in the best > case. There should be no other effect on performance. > > If you have a partitioned table, or union of multiple tables > or any other plan where multiple seq scans are performed in > arbitrary order, this change won't change the order the > partitions are scanned and won't therefore ensure they will > be synchronized. > > > Now that we have both pieces of the puzzle in place, it's > time to consider what more we can do with them: > > > 3A. To take advantage of the "cache trail" of a previous seq > scan, scan > backwards from where the previous seq scan ended, until a you hit a > buffer that's not in cache. > > This will allow taking advantage of the buffer cache even if > the table > doesn't fit completely in RAM. That can make a big difference if the > table size is just slightly bigger than RAM, and can avoid the nasty > surprise when a table grows beyond RAM size and queries start taking > minutes instead of seconds. > > This should be a non-controversial change on its own from performance > point of view. No query should get slower, and some will > become faster. > But see step 3B: > > 3B. Currently, sequential scans on a large table spoils the > buffer cache > by evicting other pages from the cache. In CVS HEAD, as soon as the > table is larger than shared_buffers, the pages in the buffer won't be > used to speed up running the same query again, and there's no > reason to > believe the pages read in would be more useful than any other page in > the database, and in particular the pages that were in the > buffer cache > before the huge seq scan. If the table being scanned is > 5 * > shared_buffers, the scan will evict every other page from the > cache if > there's no other activity in the database (max usage_count is 5). > > If the table is much larger than shared_buffers, say 10 times > as large, > even with the change 3B to read the pages that are in cache > first, using > all shared_buffers to cache the table will only speed up the query by > 10%. We should not spoil the cache for such a small gain, and use the > local buffer ring strategy instead. It's better to make > queries that are > slow anyway a little bit slower, than making queries that are > normally > really fast, slow. > > > As you may notice, 3A and 3B are at odds with each other. We can > implement both, but you can't use both strategies in the same scan. > Therefore we need to have decision logic of some kind to figure out > which strategy is optimal. > > A simple heuristic is to decide based on the table size: > > < 0.1*shared_buffers -> start from page 0, keep in cache > (like we do now) > < 5 * shared_buffers -> start from last read page, keep in cache > > 5 * shared_buffers -> start from last read page, use buffer ring > > I'm not sure about the constants, we might need to make them GUC > variables as Simon argued, but that would be the general approach. > > In the future, I'm envisioning a smarter algorithm to size the local > buffer ring. Take all buffers with usage_count=0, plus a few with > usage_count=1 from the clock sweep. That way if there's a lot > of buffers > in the buffer cache that are seldomly used, we'll use more buffers to > cache the large scan, and vice versa. And no matter how large > the scan, > it wouldn't blow all buffers from the cache. But if you > execute the same > query again, the buffers left in the cache from the last scan were > apparently useful, so we use a bigger ring this time. > > I'm going to do this incrementally, and we'll see how far we get for > 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up > Simon's patch (step 1), run some performance tests with vacuum, and > submit a patch for that. Then I'll move to Jeff's patch (step 2). > > Thoughts? Everyone happy with the roadmap? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > >
Luke Lonergan wrote: > On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc) > implement dynamic I/O caching. The experiments have shown that benefit > of re-using PG buffer cache on large sequential scans is vanishingly > small when the buffer cache size is small compared to the system memory. > Since this is a normal and recommended situation (OS I/O cache is > auto-tuning and easy to administer, etc), IMO the effort to optimize > buffer cache reuse for seq scans > 1 x buffer cache size is not > worthwhile. That's interesting. Care to share the results of the experiments you ran? I was thinking of running tests of my own with varying table sizes. The main motivation here is to avoid the sudden drop in performance when a table grows big enough not to fit in RAM. See attached diagram for what I mean. Maybe you're right and the effect isn't that bad in practice. I'm thinking of attacking 3B first anyway, because it seems much simpler to implement. > On 3B: The scenario described is "multiple readers seq scanning large > table and sharing bufcache", but in practice this is not a common > situation. The common situation is "multiple queries joining several > small tables to one or more large tables that are >> 1 x bufcache". In > the common scenario, the dominant factor is the ability to keep the > small tables in bufcache (or I/O cache for that matter) while running > the I/O bound large table scans as fast as possible. How is that different from what I described? > To that point - an important factor in achieving max I/O rate for large > tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache. > This is commonly in the range of 512KB -> 2MB, which is only important > when considering a bound on the size of the ring buffer. The effect has > been demonstrated to be significant - in the 20%+ range. Another thing > to consider is the use of readahead inside the heapscan, in which case > sizes >= 32KB are very effective. Yeah I remember the discussion on the L2 cache a while back. What do you mean with using readahead inside the heapscan? Starting an async read request? > The modifications you suggest here may not have the following > properties: > - don't pollute bufcache for seqscan of tables > 1 x bufcache > - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to > minimize L2 cache pollution So the difference is that you don't want 3A (the take advantage of pages already in buffer cache) strategy at all, and want the buffer ring strategy to kick in earlier instead. Am I reading you correctly? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki, > That's interesting. Care to share the results of the > experiments you ran? I was thinking of running tests of my > own with varying table sizes. Yah - it may take a while - you might get there faster. There are some interesting effects to look at between I/O cache performance and PG bufcache, and at those speeds the only tool I've found that actually measures scan rate in PG is VACUUM. "SELECT COUNT(*)" measures CPU consumption in the aggregation node, not scan rate. Note that the copy from I/O cache to PG bufcache is where the L2 effect is seen. > The main motivation here is to avoid the sudden drop in > performance when a table grows big enough not to fit in RAM. > See attached diagram for what I mean. Maybe you're right and > the effect isn't that bad in practice. There are going to be two performance drops, first when the table doesn't fit into PG bufcache, the second when it doesn't fit in bufcache + I/O cache. The second is severe, the first is almost insignificant (for common queries). > How is that different from what I described? My impression of your descriptions is that they overvalue the case where there are multiple scanners of a large (> 1x bufcache) table such that they can share the "first load" of the bufcache, e.g. your 10% benefit for table = 10x bufcache argument. I think this is a non-common workload, rather there are normally many small tables and several large tables such that sharing the PG bufcache is irrelevant to the query speed. > Yeah I remember the discussion on the L2 cache a while back. > > What do you mean with using readahead inside the heapscan? > Starting an async read request? Nope - just reading N buffers ahead for seqscans. Subsequent calls use previously read pages. The objective is to issue contiguous reads to the OS in sizes greater than the PG page size (which is much smaller than what is needed for fast sequential I/O). > > The modifications you suggest here may not have the following > > properties: > > - don't pollute bufcache for seqscan of tables > 1 x bufcache > > - for tables > 1 x bufcache use a ringbuffer for I/O that > is ~ 32KB to > > minimize L2 cache pollution > > So the difference is that you don't want 3A (the take > advantage of pages already in buffer cache) strategy at all, > and want the buffer ring strategy to kick in earlier instead. > Am I reading you correctly? Yes, I think the ring buffer strategy should be used when the table size is > 1 x bufcache and the ring buffer should be of a fixed size smaller than L2 cache (32KB - 128KB seems to work well). - Luke
Luke Lonergan wrote: >> What do you mean with using readahead inside the heapscan? >> Starting an async read request? > > Nope - just reading N buffers ahead for seqscans. Subsequent calls use > previously read pages. The objective is to issue contiguous reads to > the OS in sizes greater than the PG page size (which is much smaller > than what is needed for fast sequential I/O). Are you filling multiple buffers in the buffer cache with a single read-call? The OS should be doing readahead for us anyway, so I don't see how just issuing multiple ReadBuffers one after each other helps. > Yes, I think the ring buffer strategy should be used when the table size > is > 1 x bufcache and the ring buffer should be of a fixed size smaller > than L2 cache (32KB - 128KB seems to work well). I think we want to let the ring grow larger than that for updating transactions and vacuums, though, to avoid the WAL flush problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Nope - just reading N buffers ahead for seqscans. Subsequent > calls use previously read pages. The objective is to issue > contiguous reads to the OS in sizes greater than the PG page > size (which is much smaller than what is needed for fast > sequential I/O). Problem here is that eighter you issue the large read into throwaway private memory and hope that when you later read 8k you get the page from OS buffercache, or you need ScatterGather IO and a way to grab 32 buffers at once. > Yes, I think the ring buffer strategy should be used when the > table size is > 1 x bufcache and the ring buffer should be of > a fixed size smaller than L2 cache (32KB - 128KB seems to work well). How do you merge those two objectives? It seems the ring buffer needs to be at least as large as the contiguous read size. Thus you would need at least 256k ring buffer. Better yet have twice the IO size as ring buffer size, so two sessions can alternately take the lead while the other session still blocks a prev page. Modern L2 cache is 8 Mb, so 512k seems no problem ? Andreas
> >> What do you mean with using readahead inside the heapscan? > >> Starting an async read request? > > > > Nope - just reading N buffers ahead for seqscans. Subsequent calls > > use previously read pages. The objective is to issue > contiguous reads > > to the OS in sizes greater than the PG page size (which is much > > smaller than what is needed for fast sequential I/O). > > Are you filling multiple buffers in the buffer cache with a > single read-call? yes, needs vector or ScatterGather IO. > The OS should be doing readahead for us > anyway, so I don't see how just issuing multiple ReadBuffers > one after each other helps. Last time I looked OS readahead was only comparable to 32k blocked reads. 256k blocked reads still perform way better. Also when the OS is confronted with an IO storm the 256k reads perform way better than OS readahead. Andreas
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes: >> Are you filling multiple buffers in the buffer cache with a >> single read-call? > > yes, needs vector or ScatterGather IO. I would expect that to get only moderate improvement. To get the full benefit I would think you would want to either fire off a separate thread to do the read-ahead, use libaio, or funnel the read-ahead requests to a separate thread like our bgwriter only it would be a bgreader or something like that. >> The OS should be doing readahead for us >> anyway, so I don't see how just issuing multiple ReadBuffers >> one after each other helps. > > Last time I looked OS readahead was only comparable to 32k blocked reads. > 256k blocked reads still perform way better. Also when the OS is confronted > with an IO storm the 256k reads perform way better than OS readahead. Well that's going to depend on the OS. Last I checked Linux's readahead logic is pretty straightforward and doesn't try to do any better than 32k readahead and is easily fooled. However I wouldn't be surprised if that's changed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote: > I'm going to do this incrementally, and we'll see how far we get for > 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up > Simon's patch (step 1), run some performance tests with vacuum, and > submit a patch for that. Then I'll move to Jeff's patch (step 2). I think it's best to postpone 3A (one aspect of my patch). There are a couple reasons: 1) The results didn't show enough benefit yet. 2) Complex interactions with Simon's patch. I think it's an area of research for the future, but for now I just want to concentrate on the most important aspect of my patch: the synchronized scanning ( #2 in your list ). Regards,Jeff Davis
On Tue, 2007-05-08 at 07:47 -0400, Luke Lonergan wrote: > Heikki, > > On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc) > implement dynamic I/O caching. The experiments have shown that benefit > of re-using PG buffer cache on large sequential scans is vanishingly > small when the buffer cache size is small compared to the system memory. > Since this is a normal and recommended situation (OS I/O cache is > auto-tuning and easy to administer, etc), IMO the effort to optimize > buffer cache reuse for seq scans > 1 x buffer cache size is not > worthwhile. > I think 3A is still an interesting idea, but I agree that it needs some results to back it up. Let's save 3A for the next release so that we have more time to see. > To that point - an important factor in achieving max I/O rate for large > tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache. > This is commonly in the range of 512KB -> 2MB, which is only important > when considering a bound on the size of the ring buffer. The effect has > been demonstrated to be significant - in the 20%+ range. Another thing > to consider is the use of readahead inside the heapscan, in which case > sizes >= 32KB are very effective. One thing I'd like us to keep in mind is to have a reasonable number of buffers active for a sequential scan. If the number is too small, my sync scans might not even work. Right now my patch only reports every 16 pages, so 32KB (=4 pages) is not going to work for sync scans (I suppose only testing will tell). Regards,Jeff Davis
> >> Are you filling multiple buffers in the buffer cache with a single > >> read-call? > > > > yes, needs vector or ScatterGather IO. > > I would expect that to get only moderate improvement. The vast improvement comes from 256k blocksize. > To get > the full benefit I would think you would want to either fire > off a separate thread to do the read-ahead, use libaio, or > funnel the read-ahead requests to a separate thread like our > bgwriter only it would be a bgreader or something like that. I like bgreader :-) But that looks even more difficult than grabbing 32 [scattered or contiguous] buffers at once. Especially in a situation where there is no concurrent load it would be nice to do CPU work while waiting for the next read ahead IO. If there is enough parallel CPU load it is actually not so important. So I opt, that on a high load server you get nearly all benefit without any sort of aio. > >> The OS should be doing readahead for us anyway, so I don't see how > >> just issuing multiple ReadBuffers one after each other helps. > > > > Last time I looked OS readahead was only comparable to 32k blocked reads. > > 256k blocked reads still perform way better. Also when the OS is > > confronted with an IO storm the 256k reads perform way better than OS readahead. > > Well that's going to depend on the OS. Last I checked Linux's > readahead logic is pretty straightforward and doesn't try to > do any better than 32k readahead and is easily fooled. > However I wouldn't be surprised if that's changed. My test was on AIX, 32 or 64k seem quite common, at least as default setting. Also on some OS's (like HPUX) OS readahead and writebehind strategy changes with large IO blocksizes, imho beneficially. Andreas
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote: > Here's my roadmap for the "scan-resistant buffer cache" and > "synchronized scans" patches. > > 1. Fix the current vacuum behavior of throwing dirty buffers to the > freelist, forcing a lot of WAL flushes. Instead, use a backend-private > ring of shared buffers that are recycled. This is what Simon's > "scan-resistant buffer manager" did. > > The theory here is that if a page is read in by vacuum, it's unlikely to > be accessed in the near future, therefore it should be recycled. If > vacuum doesn't dirty the page, it's best to reuse the buffer immediately > for the next page. However, if the buffer is dirty (and not just because > we set hint bits), we ought to delay writing it to disk until the > corresponding WAL record has been flushed to disk. > > Simon's patch used a fixed size ring of buffers that are recycled, but I > think the ring should be dynamically sized. Start with a small ring, and > whenever you need to do a WAL flush to write a dirty buffer, increase > the ring size. On every full iteration through the ring, decrease its > size to trim down an unnecessarily large ring. > > This only alters the behavior of vacuums, and it's pretty safe to say it > won't get worse than what we have now. I think thats too much code, why not just leave it as it is. Would a dynamic buffer be substantially better? If not, why bother? > In the future, we can use the > buffer ring for seqscans as well; more on that on step 3. There was clear benefit for that. You sound like you are suggesting to remove the behaviour for Seq Scans, which wouldn't make much sense?? > 2. Implement the list/table of last/ongoing seq scan positions. This is > Jeff's "synchronized scans" patch. When a seq scan starts on a table > larger than some threshold, it starts from where the previous seq scan > is currently, or where it ended. This will synchronize the scans so that > for two concurrent scans the total I/O is halved in the best case. There > should be no other effect on performance. > > If you have a partitioned table, or union of multiple tables or any > other plan where multiple seq scans are performed in arbitrary order, > this change won't change the order the partitions are scanned and won't > therefore ensure they will be synchronized. > > > Now that we have both pieces of the puzzle in place, it's time to > consider what more we can do with them: > > > 3A. To take advantage of the "cache trail" of a previous seq scan, scan > backwards from where the previous seq scan ended, until a you hit a > buffer that's not in cache. > > This will allow taking advantage of the buffer cache even if the table > doesn't fit completely in RAM. That can make a big difference if the > table size is just slightly bigger than RAM, and can avoid the nasty > surprise when a table grows beyond RAM size and queries start taking > minutes instead of seconds. > > This should be a non-controversial change on its own from performance > point of view. No query should get slower, and some will become faster. > But see step 3B: > > 3B. Currently, sequential scans on a large table spoils the buffer cache > by evicting other pages from the cache. In CVS HEAD, as soon as the > table is larger than shared_buffers, the pages in the buffer won't be > used to speed up running the same query again, and there's no reason to > believe the pages read in would be more useful than any other page in > the database, and in particular the pages that were in the buffer cache > before the huge seq scan. If the table being scanned is > 5 * > shared_buffers, the scan will evict every other page from the cache if > there's no other activity in the database (max usage_count is 5). > > If the table is much larger than shared_buffers, say 10 times as large, > even with the change 3B to read the pages that are in cache first, using > all shared_buffers to cache the table will only speed up the query by > 10%. We should not spoil the cache for such a small gain, and use the > local buffer ring strategy instead. It's better to make queries that are > slow anyway a little bit slower, than making queries that are normally > really fast, slow. > > > As you may notice, 3A and 3B are at odds with each other. We can > implement both, but you can't use both strategies in the same scan. Not sure I've seen any evidence of that. Most scans will be solo and so should use the ring buffer, since there is clear evidence of that. If there were evidence to suggest the two patches conflict then we should turn off the ring buffer only when concurrent scans are in progress (while remembering that concurrent scans will not typically overlap as much as the synch scan tests show and so for much of their execution they too will be solo). > Therefore we need to have decision logic of some kind to figure out > which strategy is optimal. > > A simple heuristic is to decide based on the table size: > > < 0.1*shared_buffers -> start from page 0, keep in cache (like we do now) > < 5 * shared_buffers -> start from last read page, keep in cache > > 5 * shared_buffers -> start from last read page, use buffer ring > > I'm not sure about the constants, we might need to make them GUC > variables as Simon argued, but that would be the general approach. If you want to hardcode it, I'd say use the ring buffer on scans of 1000 blocks or more, or we have a GUC. Sizing things to shared_buffers isn't appropriate because of the effects of partitioning, as I argued in my last post, which I think is still valid. > Thoughts? Everyone happy with the roadmap? I think separating the patches is now the best way forward, though both are good. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi, In reference to the seq scans roadmap, I have just submitted a patch that addresses some of the concerns. The patch does this: 1. for small relation (smaller than 60% of bufferpool), use the current logic 2. for big relation:- use a ring buffer in heap scan- pin first 12 pages when scan starts- on consumption of every 4-page,read and pin the next 4-page- invalidate used pages of in the scan so they do not force out other useful pages 4 files changed: bufmgr.c, bufmgr.h, heapam.c, relscan.h If there are interests, I can submit another scan patch that returns N tuples at a time, instead of current one-at-a-time interface. This improves code locality and further improve performance by another 10-20%. For TPCH 1G tables, we are seeing more than 20% improvement in scans on the same hardware. ------------------------------------------------------------------------ - ----- PATCHED VERSION ------------------------------------------------------------------------ - gptest=# select count(*) from lineitem; count --------- 6001215 (1 row) Time: 2117.025 ms ------------------------------------------------------------------------ - ----- ORIGINAL CVS HEAD VERSION ------------------------------------------------------------------------ - gptest=# select count(*) from lineitem; count --------- 6001215 (1 row) Time: 2722.441 ms Suggestions for improvement are welcome. Regards, -cktan Greenplum, Inc. On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote: > Luke Lonergan wrote: >>> What do you mean with using readahead inside the heapscan? >>> Starting an async read request? >> Nope - just reading N buffers ahead for seqscans. Subsequent >> calls use >> previously read pages. The objective is to issue contiguous reads to >> the OS in sizes greater than the PG page size (which is much smaller >> than what is needed for fast sequential I/O). > > Are you filling multiple buffers in the buffer cache with a single > read-call? The OS should be doing readahead for us anyway, so I > don't see how just issuing multiple ReadBuffers one after each > other helps. > >> Yes, I think the ring buffer strategy should be used when the >> table size >> is > 1 x bufcache and the ring buffer should be of a fixed size >> smaller >> than L2 cache (32KB - 128KB seems to work well). > > I think we want to let the ring grow larger than that for updating > transactions and vacuums, though, to avoid the WAL flush problem. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend >
> In reference to the seq scans roadmap, I have just submitted > a patch that addresses some of the concerns. > > The patch does this: > > 1. for small relation (smaller than 60% of bufferpool), use > the current logic 2. for big relation: > - use a ring buffer in heap scan > - pin first 12 pages when scan starts > - on consumption of every 4-page, read and pin the next 4-page > - invalidate used pages of in the scan so they do not > force out other useful pages A few comments regarding the effects: I do not see how this speedup could be caused by readahead, so what are the effects ? (It should make no difference to do the CPU work for count(*) inbetween reading each block when the pages are not dirtied) Is the improvement solely reduced CPU because no search for a free buffer is needed and/or L2 cache locality ? What effect does the advance pinnig have, avoid vacuum ? A 16 x 8k page ring is too small to allow the needed IO blocksize of 256k. The readahead is done 4 x one page at a time (=32k). What is the reasoning behind 1/4 ring for readahead (why not 1/2), is 3/4 the trail for followers and bgwriter ? I think in anticipation of doing a single IO call for more that one page, the KillAndReadBuffer function should be split into two parts. One that does the killing for n pages, and one that does the reading for n pages. Killing n before reading n would also have the positive effect of grouping perhaps needed writes (not interleaving them with the reads). I think the 60% Nbuffers is a very good starting point. I would only introduce a GUC when we see evidence that it is needed (I agree with Simon's partitioning comments, but I'd still wait and see). Andreas
Zeugswetter Andreas ADI SD wrote: >> In reference to the seq scans roadmap, I have just submitted >> a patch that addresses some of the concerns. >> >> The patch does this: >> >> 1. for small relation (smaller than 60% of bufferpool), use >> the current logic 2. for big relation: >> - use a ring buffer in heap scan >> - pin first 12 pages when scan starts >> - on consumption of every 4-page, read and pin the next 4-page >> - invalidate used pages of in the scan so they do not >> force out other useful pages > > A few comments regarding the effects: > > I do not see how this speedup could be caused by readahead, so what are > the effects ? I was wondering that as well. We'd really need to test all the changes separately to see where the improvements are really coming from. Also, that patch doesn't address the VACUUM issue at all. And using a small fixed size ring with scans that do updates can be devastating. I'm experimenting with different ring sizes for COPY at the moment. Too small ring leads to a lot of WAL flushes, it's basically the same problem we have with VACUUM in CVS HEAD. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Also, that patch doesn't address the VACUUM issue at all. And > using a small fixed size ring with scans that do updates can > be devastating. I'm experimenting with different ring sizes > for COPY at the moment. Too small ring leads to a lot of WAL > flushes, it's basically the same problem we have with VACUUM > in CVS HEAD. My first take on that would be to simply abandon any dirty (and actually also any still pinned) buffer from the ring and replace the ring slot with a buffer from the freelist. If the freelist is empty and LSN allows writing the buffer, write it (and maybe try to group these). If the LSN does not allow the write, replace the slot with a buffer from LRU. Andreas
Zeugswetter Andreas ADI SD wrote: >> Also, that patch doesn't address the VACUUM issue at all. And >> using a small fixed size ring with scans that do updates can >> be devastating. I'm experimenting with different ring sizes >> for COPY at the moment. Too small ring leads to a lot of WAL >> flushes, it's basically the same problem we have with VACUUM >> in CVS HEAD. > > My first take on that would be to simply abandon any dirty (and actually > also any still pinned) buffer from the ring and replace the ring slot > with a buffer from the freelist. > If the freelist is empty and LSN allows writing the buffer, write it > (and maybe try to group these). > If the LSN does not allow the write, replace the slot with a buffer from > LRU. That would effectively disable the ring for COPY and the 2nd phase of VACUUM. One problem with looking at the LSN is that you need the content lock to read it, and I wouldn't want to add any new locking. It could be done inside FlushBuffer when we hold the lock anyway, but I'm afraid the changes would be pretty invasive. I'm struggling to get a grip of what the optimal ring size is under various circumstances. Some thoughts I have this far: - a small ring gives better L2 cache behavior - for read-only queries, and for queries that just hint bits, 1 buffer is enough - small ring with query that writes WAL (COPY, mass updates, 2nd phase of VACUUM) leads to a lot of WAL flushes, which can become bottleneck. But all these assumptions need to be validated. I'm setting up tests with different ring sizes and queries to get a clear picture of this: - VACUUM on a clean table - VACUUM on a table with 1 dead tuple per page - read-only scan, large table - read-only scan, table fits in OS cache - COPY In addition, I'm going to run VACUUM in a DBT-2 test to see the affect on other queries running concurrently. I think a ring that grows when WAL flushes occur covers all the use cases reasonably well, but I need to do the testing... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > But all these assumptions need to be validated. I'm setting up tests > with different ring sizes and queries to get a clear picture of this: > - VACUUM on a clean table > - VACUUM on a table with 1 dead tuple per page > - read-only scan, large table > - read-only scan, table fits in OS cache > - COPY Just to keep you guys informed, here's my results on a read-only scan on a table bigger than shared_buffers but smaller than RAM: select-1 | 00:00:10.853831 select-1 | 00:00:10.380667 select-1 | 00:00:11.530528 select-2 | 00:00:08.634105select-2 | 00:00:02.674084 select-4 | 00:00:02.65664 select-8 | 00:00:02.662922 select-16 | 00:00:02.682475select-32 | 00:00:02.693163 select-64 | 00:00:02.722031 select-128 | 00:00:02.873645 select-256 | 00:00:03.185586select-512 | 00:00:03.534285 select-1024 | 00:00:03.741867 lshw utility tells me that this server has 32KB of L1 cache and 4MB of L2 cache. The performance starts to drop between 64-128 buffers, which is 512 - 1024 KB, so I'm not sure how it's related to cache size but using a small number of buffers is clearly better than using a large number. However, it caught me by total surprise that the performance with 1 buffer is so horrible. Using 2 buffers is enough to avoid whatever the issue is with just 1 buffer. I have no idea what's causing that. There must be some interaction that I don't understand. All the numbers are quite repeatable, I ran the same test script many times. The runtime of the first select-2 test however varied between 3-10 seconds, somehow the bad karma from using just 1 buffer in the earlier test carries over to the next test. I'm not sure what to think about this, but I'll set up more test scenarios with VACUUM and COPY. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > However, it caught me by total surprise that the performance with 1 > buffer is so horrible. Using 2 buffers is enough to avoid whatever the > issue is with just 1 buffer. I have no idea what's causing that. There > must be some interaction that I don't understand. Ok, I found the reason for that. I was using this query for the selects: SELECT COUNT(*) FROM (SELECT 1 FROM stock_copytest LIMIT 10000000) AS a; Stock_copytest is larger than RAM size, that's why I used the LIMIT to make the result set memory resident. That had the side effect that apparently the limit-node kept the single buffer pinned which defeated the buffer ring completely. To avoid issues like that we apparently want to use 2-4 buffers instead of just 1. I'll review my test methodology and keep testing... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
The patch has no effect on scans that do updates. The KillAndReadBuffer routine does not force out a buffer if the dirty bit is set. So updated pages revert to the current performance characteristics. -cktan GreenPlum, Inc. On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote: > Zeugswetter Andreas ADI SD wrote: >>> In reference to the seq scans roadmap, I have just submitted a >>> patch that addresses some of the concerns. >>> >>> The patch does this: >>> >>> 1. for small relation (smaller than 60% of bufferpool), use the >>> current logic 2. for big relation: >>> - use a ring buffer in heap scan >>> - pin first 12 pages when scan starts >>> - on consumption of every 4-page, read and pin the next 4-page >>> - invalidate used pages of in the scan so they do not force out >>> other useful pages >> A few comments regarding the effects: >> I do not see how this speedup could be caused by readahead, so >> what are >> the effects ? > > I was wondering that as well. We'd really need to test all the > changes separately to see where the improvements are really coming > from. > > Also, that patch doesn't address the VACUUM issue at all. And using > a small fixed size ring with scans that do updates can be > devastating. I'm experimenting with different ring sizes for COPY > at the moment. Too small ring leads to a lot of WAL flushes, it's > basically the same problem we have with VACUUM in CVS HEAD. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Sorry, 16x8K page ring is too small indeed. The reason we selected 16 is because greenplum db runs on 32K page size, so we are indeed reading 128K at a time. The #pages in the ring should be made relative to the page size, so you achieve 128K per read. Also agree that KillAndReadBuffer could be split into a KillPinDontRead(), and ReadThesePinnedPages() functions. However, we are thinking of AIO and would rather see a ReadNPagesAsync() function. -cktan Greenplum, Inc. On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote: > >> In reference to the seq scans roadmap, I have just submitted >> a patch that addresses some of the concerns. >> >> The patch does this: >> >> 1. for small relation (smaller than 60% of bufferpool), use >> the current logic 2. for big relation: >> - use a ring buffer in heap scan >> - pin first 12 pages when scan starts >> - on consumption of every 4-page, read and pin the next 4-page >> - invalidate used pages of in the scan so they do not >> force out other useful pages > > A few comments regarding the effects: > > I do not see how this speedup could be caused by readahead, so what > are > the effects ? > (It should make no difference to do the CPU work for count(*) > inbetween > reading each block when the pages are not dirtied) > Is the improvement solely reduced CPU because no search for a free > buffer is needed and/or L2 cache locality ? > > What effect does the advance pinnig have, avoid vacuum ? > > A 16 x 8k page ring is too small to allow the needed IO blocksize of > 256k. > The readahead is done 4 x one page at a time (=32k). > What is the reasoning behind 1/4 ring for readahead (why not 1/2), is > 3/4 the trail for followers and bgwriter ? > > I think in anticipation of doing a single IO call for more that one > page, the KillAndReadBuffer function should be split into two > parts. One > that does the killing > for n pages, and one that does the reading for n pages. > Killing n before reading n would also have the positive effect of > grouping perhaps needed writes (not interleaving them with the reads). > > I think the 60% Nbuffers is a very good starting point. I would only > introduce a GUC when we see evidence that it is needed (I agree with > Simon's partitioning comments, but I'd still wait and see). > > Andreas >
> Sorry, 16x8K page ring is too small indeed. The reason we > selected 16 is because greenplum db runs on 32K page size, so > we are indeed reading 128K at a time. The #pages in the ring > should be made relative to the page size, so you achieve 128K > per read. Ah, ok. New disks here also have a peak at 128k with no other concurrent IO. Writes benefit from larger blocksizes though, 512k and more. Reads with other concurrent IO might also benefit from larger blocksizes. Comment to all: to test optimal blocksizes make sure you have other concurrent IO on the disk. > Also agree that KillAndReadBuffer could be split into a > KillPinDontRead(), and ReadThesePinnedPages() functions. > However, we are thinking of AIO and would rather see a > ReadNPagesAsync() function. Yes, you could start the aio and return an already read buffer to allow concurrent cpu work. However, you would still want to do blocked aio_readv calls to make sure the physical read uses the large blocksize. So I'd say aio would benefit from the same split. In another posting you wrote: > The patch has no effect on scans that do updates. > The KillAndReadBuffer routine does not force out a buffer if > the dirty bit is set. So updated pages revert to the current > performance characteristics. Yes I see, the ring slot is replaced by a standard ReadBuffer in that case, looks good. I still think it would be better to write out the buffers and keep them in the ring when possible, but that seems to need locks and some sort of synchronization with the new walwriter, so looks like a nice project for after 8.3. Andreas
I wrote: > I'll review my test methodology and keep testing... I ran a set of tests on a 100 warehouse TPC-C stock table that is ~3.2 GB in size and the server has 4 GB of memory. IOW the table fits in OS cache, but not in shared_buffers (set at 1 GB). copy - COPY from a file select - SELECT COUNT(*) FROM stock vacuum - VACUUM on a clean table, effectively a read-only operation vacuum_hintbits - VACUUM on a table with no dead tuples, but hint bits need to be set on every page vacuum_dirty - VACUUM with exactly 1 dead tuple per page, The number after the test name is the ring size used. There was no indexes on the table, which means that the vacuum tests only had to do one pass. The 1st vacuum phase of a real-world table is like a mixture of vacuum- and vacuum_hintbits-tests, and 2nd phase is like the vacuum_dirty test. copy-1 | 00:31:47.042365 copy-2 | 00:17:57.630772 copy-4 | 00:17:55.041794 copy-8 | 00:08:31.014009 copy-16 | 00:05:38.39848 copy-32 | 00:05:52.295512 copy-64 | 00:06:08.404646 copy-128 | 00:05:05.032448 copy-256 | 00:05:48.573146 copy-512 | 00:04:56.098752 copy-1024 | 00:05:27.05316 select-4 | 00:00:04.344873 select-4 | 00:00:02.2498 select-1 | 00:00:08.754011 select-1 | 00:00:10.521174 select-1 | 00:00:10.819376 select-1 | 00:00:14.818831 select-1 | 00:00:14.893562 select-1 | 00:00:16.973934 select-2 | 00:00:15.722776 select-2 | 00:00:02.291078 select-2 | 00:00:02.230167 select-4 | 00:00:02.232935 select-8 | 00:00:02.238791 select-16 | 00:00:02.245566 select-32 | 00:00:02.267158 select-64 | 00:00:02.311878 select-128 | 00:00:02.487086 select-256 | 00:00:02.764085 select-512 | 00:00:03.161025 select-1024 | 00:00:03.387246 vacuum-1 | 00:00:01.843337 vacuum-2 | 00:00:01.612738 vacuum-4 | 00:00:01.6304 vacuum-8 | 00:00:01.655126 vacuum-16 | 00:00:01.641808 vacuum-32 | 00:00:01.664108 vacuum-64 | 00:00:01.729106 vacuum-128 | 00:00:01.879023 vacuum-256 | 00:00:02.218303 vacuum-512 | 00:00:02.569571 vacuum-1024 | 00:00:02.791995 vacuum_dirty-1 | 00:24:15.424337 vacuum_dirty-2 | 00:13:26.981835 vacuum_dirty-4 | 00:08:07.260113 vacuum_dirty-8 | 00:05:24.1476 vacuum_dirty-16 | 00:03:52.690336 vacuum_dirty-32 | 00:02:40.759203 vacuum_dirty-64 | 00:02:45.14425 vacuum_dirty-128 | 00:02:46.718922 vacuum_dirty-256 | 00:02:43.797785 vacuum_dirty-512 | 00:02:36.363763 vacuum_dirty-1024 | 00:02:32.767481 vacuum_hintbits-1 | 00:00:37.847935 vacuum_hintbits-2 | 00:00:38.788662 vacuum_hintbits-4 | 00:00:43.554029 vacuum_hintbits-8 | 00:00:42.040379 vacuum_hintbits-16 | 00:00:44.187508 vacuum_hintbits-32 | 00:00:38.252052 vacuum_hintbits-64 | 00:00:37.920379 vacuum_hintbits-128 | 00:00:38.463007 vacuum_hintbits-256 | 00:00:38.157724 vacuum_hintbits-512 | 00:00:38.309285 vacuum_hintbits-1024 | 00:00:39.178738 I ran the some of the select tests multiple times because the behavior changed when the test was repeated. I don't know what's going on in the select-1 test, it looks like the same effect I had with the more complex query involving a LIMIT-node, but this time I'm just doing a plain SELECT COUNT(*). I ran the test script multiple times; the results shown above are copy-pasted from one particular run but the numbers didn't change much from run to run. In particular, the run times for the select-1 test really do increase as you repeat the test many times. The copy results seem to vary quite a bit, though. For comparison, here's the test results with vanilla CVS HEAD: copy-head | 00:06:21.533137 copy-head | 00:05:54.141285 select-head | 00:00:16.213693 select-head | 00:00:18.500792 vacuum-head | 00:00:12.843479 vacuum-head | 00:00:08.719845 vacuum_dirty-head | 00:22:02.533553 vacuum_dirty-head | 00:22:02.852786 vacuum_hintbits-head | 00:00:38.278701 vacuum_hintbits-head | 00:00:35.226191 Looking at the results, it seems that using a fixed sized ring of 32 pages hits the sweet spot on all tests. I wonder if that holds on other hardware. The test scripts I used are attached. I used a modified DBT-2 schema and dump file, so you'll need to replace that with some other large table to run it. I would appreciate it if others would repeat the tests on other hardware to get a bigger sample. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com /* drop table if exists stock100; create table stock100 ( s_i_id integer , s_w_id smallint , s_quantity smallint , s_order_cnt smallint -- not listed as a monetary value , s_remote_cnt smallint -- not listed as a monetary value , s_ytd integer -- not listed as a monetary value , s_dist_01 char(24) , s_dist_02 char(24) , s_dist_03 char(24) , s_dist_04 char(24) , s_dist_05 char(24) , s_dist_06 char(24) , s_dist_07 char(24) , s_dist_08 char(24) , s_dist_09 char(24) , s_dist_10 char(24) , s_data text -- varchar(50) ); drop table if exists testresult; CREATE TABLE testresult ( description text NOT NULL, begints timestamp DEFAULT (now()) NOT NULL, endts timestamp); */ --- /* TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('copy-1'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; --- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 2; INSERT INTO testresult (description) VALUES ('copy-2'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; --- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 4; INSERT INTO testresult (description) VALUES ('copy-4'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; --- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 8; INSERT INTO testresult (description) VALUES ('copy-8'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; --- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 16; INSERT INTO testresult (description) VALUES ('copy-16'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 32; INSERT INTO testresult (description) VALUES ('copy-32'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 64; INSERT INTO testresult (description) VALUES ('copy-64'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 128; INSERT INTO testresult (description) VALUES ('copy-128'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 256; INSERT INTO testresult (description) VALUES ('copy-256'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 512; INSERT INTO testresult (description) VALUES ('copy-512'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- TRUNCATE stock100; CHECKPOINT; SET scan_recycle_buffers = 1024; INSERT INTO testresult (description) VALUES ('copy-1024'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; */ ---- /* SELECT COUNT(*) FROM stock100; -- set hint bits CHECKPOINT; SET scan_recycle_buffers = 4; INSERT INTO testresult (description) VALUES ('select-4'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 4; INSERT INTO testresult (description) VALUES ('select-4'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('select-1'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 2; INSERT INTO testresult (description) VALUES ('select-2'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 2; INSERT INTO testresult (description) VALUES ('select-2'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; SET scan_recycle_buffers = 2; INSERT INTO testresult (description) VALUES ('select-2'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 4; INSERT INTO testresult (description) VALUES ('select-4'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 8; INSERT INTO testresult (description) VALUES ('select-8'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 16; INSERT INTO testresult (description) VALUES ('select-16'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 32; INSERT INTO testresult (description) VALUES ('select-32'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 64; INSERT INTO testresult (description) VALUES ('select-64'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 128; INSERT INTO testresult (description) VALUES ('select-128'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 256; INSERT INTO testresult (description) VALUES ('select-256'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 512; INSERT INTO testresult (description) VALUES ('select-512'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 1024; INSERT INTO testresult (description) VALUES ('select-1024'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; */ ---- /* ------- VACUUM tests ------- CHECKPOINT; SET scan_recycle_buffers = 1; INSERT INTO testresult (description) VALUES ('vacuum-1'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 2; INSERT INTO testresult (description) VALUES ('vacuum-2'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 4; INSERT INTO testresult (description) VALUES ('vacuum-4'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 8; INSERT INTO testresult (description) VALUES ('vacuum-8'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 16; INSERT INTO testresult (description) VALUES ('vacuum-16'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 32; INSERT INTO testresult (description) VALUES ('vacuum-32'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 64; INSERT INTO testresult (description) VALUES ('vacuum-64'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 128; INSERT INTO testresult (description) VALUES ('vacuum-128'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 256; INSERT INTO testresult (description) VALUES ('vacuum-256'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 512; INSERT INTO testresult (description) VALUES ('vacuum-512'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- CHECKPOINT; SET scan_recycle_buffers = 1024; INSERT INTO testresult (description) VALUES ('vacuum-1024'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 1; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-1'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 2; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-2'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 4; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-4'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 8; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-8'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 16; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-16'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 32; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-32'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 64; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-64'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 128; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-128'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 256; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-256'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 512; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-512'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; SET scan_recycle_buffers = 1024; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-1024'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; */ SET scan_recycle_buffers = 1024; DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 1; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-1'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 2; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-2'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 4; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-4'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 8; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-8'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 16; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-16'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 32; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-32'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 64; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-64'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 128; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-128'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 256; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-256'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 512; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-512'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- SET scan_recycle_buffers = 1024; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; SET scan_recycle_buffers = 1024; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-1024'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; SELECT description, endts-begints FROM testresult;/* drop table if exists stock100; create table stock100 ( s_i_id integer , s_w_id smallint , s_quantity smallint , s_order_cnt smallint -- not listed as a monetary value , s_remote_cnt smallint -- not listed as a monetary value , s_ytd integer -- not listed as a monetary value , s_dist_01 char(24) , s_dist_02 char(24) , s_dist_03 char(24) , s_dist_04 char(24) , s_dist_05 char(24) , s_dist_06 char(24) , s_dist_07 char(24) , s_dist_08 char(24) , s_dist_09 char(24) , s_dist_10 char(24) , s_data text -- varchar(50) ); -- drop table if exists testresult; CREATE TABLE testresult ( description text NOT NULL, begints timestamp DEFAULT (now()) NOT NULL, endts timestamp); --- TRUNCATE stock100; CHECKPOINT; INSERT INTO testresult (description) VALUES ('copy-head'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; --- TRUNCATE stock100; CHECKPOINT; INSERT INTO testresult (description) VALUES ('copy-head'); COPY stock100 FROM '/mnt/data/dbt2/test-w100/dbdata/stock.data'; UPDATE testresult SET endts = now() WHERE endts IS NULL; SELECT COUNT(*) FROM stock100; -- set hint bits CHECKPOINT; INSERT INTO testresult (description) VALUES ('select-head'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; INSERT INTO testresult (description) VALUES ('select-head'); SELECT COUNT(*) FROM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ------- VACUUM tests ------- CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum-head'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum-head'); VACUUM stock100; UPDATE testresult SET endts = now() WHERE endts IS NULL; ---- DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-head'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)' ; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_dirty-head'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; */ DROP TABLE IF EXISTS stock100_copy; SELECT * INTO stock100_copy FROM stock100; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-head'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; BEGIN; DELETE FROM stock100_copy WHERE textin(tidout(ctid)) LIKE '%,1)'; ROLLBACK; CHECKPOINT; INSERT INTO testresult (description) VALUES ('vacuum_hintbits-head'); VACUUM VERBOSE stock100_copy; UPDATE testresult SET endts = now() WHERE endts IS NULL; SELECT description, endts-begints FROM testresult;
On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote: > For comparison, here's the test results with vanilla CVS HEAD: > > copy-head | 00:06:21.533137 > copy-head | 00:05:54.141285 I'm slightly worried that the results for COPY aren't anywhere near as good as the SELECT and VACUUM results. It isn't clear from those numbers that the benefit really is significant. Are you thinking that having COPY avoid cache spoiling is a benefit just of itself? Or do you see a pattern of benefit from your other runs? (BTW what was wal_buffers set to? At least twice the ring buffer size, hopefully). -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi Simon, On 5/12/07 12:35 AM, "Simon Riggs" <simon@enterprisedb.com> wrote: > I'm slightly worried that the results for COPY aren't anywhere near as > good as the SELECT and VACUUM results. It isn't clear from those numbers > that the benefit really is significant. COPY is bottlenecked on datum formation and format translation with very low performance, so I don't think we should expect the ring buffer to make much of a dent. - Luke
Hi All, COPY/INSERT are also bottlenecked on record at a time insertion into heap, and in checking for pre-insert trigger, post-insert trigger and constraints. To speed things up, we really need to special case insertions without triggers and constraints, [probably allow for unique constraints], and make these insertions to go into heap N tuples at a time. With this change, comes the benefit of optimizing REDO log to log multiple inserts or even logging a whole new heap page that gets filled in a single WAL record. Those with triggers and other constraints would still have to go in one at a time because of the trigger/constraints semantics. It seems to me that dirty pages should be written out by the bg writer instead of circumventing it using ring buffer. If it is slow, we should change bg writer. -cktan On May 12, 2007, at 8:42 AM, Luke Lonergan wrote: > Hi Simon, > > On 5/12/07 12:35 AM, "Simon Riggs" <simon@enterprisedb.com> wrote: > >> I'm slightly worried that the results for COPY aren't anywhere >> near as >> good as the SELECT and VACUUM results. It isn't clear from those >> numbers >> that the benefit really is significant. > > COPY is bottlenecked on datum formation and format translation with > very low > performance, so I don't think we should expect the ring buffer to > make much > of a dent. > > - Luke >
"CK Tan" <cktan@greenplum.com> writes: > COPY/INSERT are also bottlenecked on record at a time insertion into > heap, and in checking for pre-insert trigger, post-insert trigger and > constraints. > To speed things up, we really need to special case insertions without > triggers and constraints, [probably allow for unique constraints], Do you have any profiling data to back up these assertions? I haven't noticed that firing zero tuples takes any visible percentage of COPY time. regards, tom lane
Sorry, I should have been clearer. I meant because we need to check for trigger firing pre/post insertion, and the trigger definitions expect tuples to be inserted one by one, therefore we cannot insert N- tuples at a time into the heap. Checking for triggers itself is not taking up much CPU at all. If we could predetermine that there is not any triggers for a relation, inserts into that relation could then follow a different path that inserts N-tuples at a time. Regards, -cktan On May 13, 2007, at 4:54 PM, Tom Lane wrote: > "CK Tan" <cktan@greenplum.com> writes: >> COPY/INSERT are also bottlenecked on record at a time insertion into >> heap, and in checking for pre-insert trigger, post-insert trigger and >> constraints. > >> To speed things up, we really need to special case insertions without >> triggers and constraints, [probably allow for unique constraints], > > Do you have any profiling data to back up these assertions? I haven't > noticed that firing zero tuples takes any visible percentage of COPY > time. > > regards, tom lane >
Simon Riggs wrote: > On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote: >> For comparison, here's the test results with vanilla CVS HEAD: >> >> copy-head | 00:06:21.533137 >> copy-head | 00:05:54.141285 > > I'm slightly worried that the results for COPY aren't anywhere near as > good as the SELECT and VACUUM results. It isn't clear from those numbers > that the benefit really is significant. Agreed, the benefit isn't clear. > Are you thinking that having COPY avoid cache spoiling is a benefit just > of itself? Or do you see a pattern of benefit from your other runs? I think it's worth having just to avoid cache spoiling. I wouldn't bother otherwise, but since we have the infrastructure for vacuum and large seqscans, we might as well use it for COPY as well. > (BTW what was wal_buffers set to? At least twice the ring buffer size, > hopefully). Good question. [checks]. wal_buffers was set to 128KB. I tried raising it to 1MB, but it didn't make any difference. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Just to keep you guys informed, I've been busy testing and pondering over different buffer ring strategies for vacuum, seqscans and copy. Here's what I'm going to do: Use a fixed size ring. Fixed as in doesn't change after the ring is initialized, however different kinds of scans use differently sized rings. I said earlier that it'd be invasive change to see if a buffer needs a WAL flush and choose another victim if that's the case. I looked at it again and found a pretty clean way of doing that, so I took that approach for seq scans. 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to give the L2 cache benefits and keep cache pollution low, but at the same time it's large enough that it keeps the need to WAL flush reasonable (1/32 of what we do now). 2. For sequential scans, also use a ring of 32 buffers, but whenever a buffer in the ring would need a WAL flush to recycle, we throw it out of the buffer ring instead. On read-only scans (and scans that only update hint bit) this gives the L2 cache benefits and doesn't pollute the buffer cache. On bulk updates, it's effectively the current behavior. On scans that do some updates, it's something in between. In all cases it should be no worse than what we have now. 32 buffers should be large enough to leave a "cache trail" for Jeff's synchronized scans to work. 3. For COPY that doesn't write WAL, use the same strategy as for sequential scans. This keeps the cache pollution low and gives the L2 cache benefits. 4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We want to use a ring that can accommodate 1 WAL segment worth of data, to avoid having to do any extra WAL flushes, and the WAL segment size is 2048 pages in the default configuration. Some alternatives I considered but rejected: * Instead of throwing away dirtied buffers in seq scans, accumulate them in another fixed sized list. When the list gets full, do a WAL flush and put them to the shared freelist or a backend-private freelist. That would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs, and it could be used for vacuum as well. I think this would be the optimal algorithm but I don't feel like inventing something that complicated at this stage anymore. Maybe for 8.4. * Using a different sized ring for 1st and 2nd vacuum phase. Decided that it's not worth the trouble, the above is already an order of magnitude better than the current behavior. I'm going to rerun the performance tests I ran earlier with new patch, tidy it up a bit, and submit it in the next few days. This turned out to be even more laborious patch to review than I thought. While the patch is short and in the end turned out to be very close to Simon's original patch, there's many different usage scenarios that need to be catered for and tested. I still need to check the interaction with Jeff's patch. This is close enough to Simon's original patch that I believe the results of the tests Jeff ran earlier are still valid. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki, 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache effect. How about using 256/blocksize? - Luke > -----Original Message----- > From: Heikki Linnakangas [mailto:hlinnaka@gmail.com] On > Behalf Of Heikki Linnakangas > Sent: Tuesday, May 15, 2007 2:32 AM > To: PostgreSQL-development > Cc: Simon Riggs; Zeugswetter Andreas ADI SD; CK.Tan; Luke > Lonergan; Jeff Davis > Subject: Re: [HACKERS] Seq scans roadmap > > Just to keep you guys informed, I've been busy testing and > pondering over different buffer ring strategies for vacuum, > seqscans and copy. > Here's what I'm going to do: > > Use a fixed size ring. Fixed as in doesn't change after the > ring is initialized, however different kinds of scans use > differently sized rings. > > I said earlier that it'd be invasive change to see if a > buffer needs a WAL flush and choose another victim if that's > the case. I looked at it again and found a pretty clean way > of doing that, so I took that approach for seq scans. > > 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small > enough to give the L2 cache benefits and keep cache pollution > low, but at the same time it's large enough that it keeps the > need to WAL flush reasonable > (1/32 of what we do now). > > 2. For sequential scans, also use a ring of 32 buffers, but > whenever a buffer in the ring would need a WAL flush to > recycle, we throw it out of the buffer ring instead. On > read-only scans (and scans that only update hint bit) this > gives the L2 cache benefits and doesn't pollute the buffer > cache. On bulk updates, it's effectively the current > behavior. On scans that do some updates, it's something in > between. In all cases it should be no worse than what we have > now. 32 buffers should be large enough to leave a "cache > trail" for Jeff's synchronized scans to work. > > 3. For COPY that doesn't write WAL, use the same strategy as > for sequential scans. This keeps the cache pollution low and > gives the L2 cache benefits. > > 4. For COPY that writes WAL, use a large ring of 2048-4096 > buffers. We want to use a ring that can accommodate 1 WAL > segment worth of data, to avoid having to do any extra WAL > flushes, and the WAL segment size is > 2048 pages in the default configuration. > > Some alternatives I considered but rejected: > > * Instead of throwing away dirtied buffers in seq scans, > accumulate them in another fixed sized list. When the list > gets full, do a WAL flush and put them to the shared freelist > or a backend-private freelist. That would eliminate the cache > pollution of bulk DELETEs and bulk UPDATEs, and it could be > used for vacuum as well. I think this would be the optimal > algorithm but I don't feel like inventing something that > complicated at this stage anymore. Maybe for 8.4. > > * Using a different sized ring for 1st and 2nd vacuum phase. > Decided that it's not worth the trouble, the above is already > an order of magnitude better than the current behavior. > > > I'm going to rerun the performance tests I ran earlier with > new patch, tidy it up a bit, and submit it in the next few > days. This turned out to be even more laborious patch to > review than I thought. While the patch is short and in the > end turned out to be very close to Simon's original patch, > there's many different usage scenarios that need to be > catered for and tested. > > I still need to check the interaction with Jeff's patch. This > is close enough to Simon's original patch that I believe the > results of the tests Jeff ran earlier are still valid. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > >
Luke Lonergan wrote: > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache > effect. > > How about using 256/blocksize? Sounds reasonable. We need to check the effect on the synchronized scans, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote: > Luke Lonergan wrote: > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache > > effect. > > > > How about using 256/blocksize? > > Sounds reasonable. We need to check the effect on the synchronized > scans, though. > I am a little worried that there will be greater differences in position as the number of scans increase. If we have only 8 buffers and several scans progressing, will they all be able to stay within a few buffers of eachother at any given time? Also, with 8 buffers, that means each scan must report every 4 pages at most (and maybe every page), which increases lock contention (the new design Heikki and I discussed requires a lock every time a backend reports its position). Regards,Jeff Davis
On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote: > On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote: > > Luke Lonergan wrote: > > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache > > > effect. > > > > > > How about using 256/blocksize? > > > > Sounds reasonable. We need to check the effect on the synchronized > > scans, though. > > > > I am a little worried that there will be greater differences in position > as the number of scans increase. If we have only 8 buffers and several > scans progressing, will they all be able to stay within a few buffers of > eachother at any given time? > > Also, with 8 buffers, that means each scan must report every 4 pages at > most (and maybe every page), which increases lock contention (the new > design Heikki and I discussed requires a lock every time a backend > reports its position). Given that spoiling the L2 cache is a trivial cost compared to extra physical IO, ISTM we should go with a largish ring for sync scans. What do you think would be the ideal size? 32 buffers? -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jeff Davis wrote: > On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote: >> Luke Lonergan wrote: >>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache >>> effect. >>> >>> How about using 256/blocksize? >> Sounds reasonable. We need to check the effect on the synchronized >> scans, though. >> > > I am a little worried that there will be greater differences in position > as the number of scans increase. If we have only 8 buffers and several > scans progressing, will they all be able to stay within a few buffers of > eachother at any given time? I'm not sure. Needs testing. Assuming the scan leaves behind a cache trail in the OS cache as well, it might not be that bad if a scan joining the party starts a little bit behind. > Also, with 8 buffers, that means each scan must report every 4 pages at > most (and maybe every page), which increases lock contention (the new > design Heikki and I discussed requires a lock every time a backend > reports its position). Keep in mind that processing a 32K page takes longer than processing an 8K page. But we'll see.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> > > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 > > > > cache effect. I'd say in a scenario where 32k pages are indicated you will also want larger than average L2 caches. > > > > > > > > How about using 256/blocksize? The reading ahead uses 1/4 ring size. To the best of our knowledge, this 1/4 needs to be 128k for reading. So I'd say we need 512/blocksize. Andreas
I think the analysis on syncscan needs to take the external I/O cache into account. I believe it is not necessary or desirable to keep the scans in lock step within the PG bufcache. The main benefit of a sync scan will be the ability to start the scan where other scans have already filled the I/O cache with useful blocks. This will require some knowledge of the size of the I/O cache by the syncscan logic, but that's where the largest amount of I/O cache memory (by far) is located. - Luke On 5/15/07 3:34 PM, "Jim C. Nasby" <decibel@decibel.org> wrote: > On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote: >> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote: >>> Luke Lonergan wrote: >>>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache >>>> effect. >>>> >>>> How about using 256/blocksize? >>> >>> Sounds reasonable. We need to check the effect on the synchronized >>> scans, though. >>> >> >> I am a little worried that there will be greater differences in position >> as the number of scans increase. If we have only 8 buffers and several >> scans progressing, will they all be able to stay within a few buffers of >> eachother at any given time? >> >> Also, with 8 buffers, that means each scan must report every 4 pages at >> most (and maybe every page), which increases lock contention (the new >> design Heikki and I discussed requires a lock every time a backend >> reports its position). > > Given that spoiling the L2 cache is a trivial cost compared to extra > physical IO, ISTM we should go with a largish ring for sync scans. What > do you think would be the ideal size? 32 buffers?
On Wed, 2007-05-16 at 10:31 -0700, Luke Lonergan wrote: > I think the analysis on syncscan needs to take the external I/O cache into > account. I believe it is not necessary or desirable to keep the scans in > lock step within the PG bufcache. I partially agree. I don't think we need any huge amount of shared buffers for the scans to use, and the scans should be able to catch up by using the OS buffer cache (which is still faster than fetching from disk). However, as I said before, I'm worried that, if the ring is too small, it might not work to my expectations. I will try to test this to eliminate my doubts. > The main benefit of a sync scan will be the ability to start the scan where > other scans have already filled the I/O cache with useful blocks. This will > require some knowledge of the size of the I/O cache by the syncscan logic, > but that's where the largest amount of I/O cache memory (by far) is located. > I don't think it's necessarily the largest "by far". However, it may be the largest. If you mean that the main benefit of sync scans is to make use of blocks that happen to be in cache before the scan began, I disagree. I think that's a possible benefit, but I was unable to show any huge benefit in my tests (someone may be able to on different hardware with different test cases). The main benefits that I see are:(1) reduce total number of blocks read from disk by making use of blocks as they are read by another concurrent seqscan.(2) eliminate the need for random I/O on concurrent sequential scans. I like the idea of using already-in-cache blocks. The code is there and it works, but I just didn't see the benefits yet. After I get any issues with this patch resolved and the reviewers are satisfied, I'd like to work on it. Regards,Jeff Davis
Hi Jeff, On 5/16/07 4:56 PM, "Jeff Davis" <pgsql@j-davis.com> wrote: >> The main benefit of a sync scan will be the ability to start the scan where >> other scans have already filled the I/O cache with useful blocks. This will >> require some knowledge of the size of the I/O cache by the syncscan logic, >> but that's where the largest amount of I/O cache memory (by far) is located. > I don't think it's necessarily the largest "by far". However, it may be > the largest. Compare the size of a 32 block ring buffer (between 256KB and 1024KB) and 16,000,000 KB of RAM on a common server, automatically used to maximum extent by the OS dynamic I/O caching. > If you mean that the main benefit of sync scans is to make use of blocks > that happen to be in cache before the scan began, I disagree. That's not what I meant. > I think > that's a possible benefit, but I was unable to show any huge benefit in > my tests (someone may be able to on different hardware with different > test cases). I agree, I don't think this is worth pursuing. > The main benefits that I see are: > (1) reduce total number of blocks read from disk by making use of > blocks as they are read by another concurrent seqscan. > (2) eliminate the need for random I/O on concurrent sequential scans. Yes on (1), but with (2), again, the OS prefetch reduces the seeking to a minimal level. With (1), we just have to define the meaning of "concurrent" to be "within the span of the OS I/O cache" and we're describing the same effect. - Luke