Thread: Does larger i/o size make sense?
Hello, A few days before, I got a question as described in the subject line on a discussion with my colleague. In general, larger i/o size per system call gives us wider bandwidth on sequential read, than multiple system calls with smaller i/o size. Probably, people knows this heuristics. On the other hand, PostgreSQL always reads database files by BLCKSZ (= usually, 8KB) when referenced block was not on the shared buffer, however, it doesn't seem to me it can pull maximum performance of modern storage system. I'm not certain whether we had discussed this kind of ideas, or not. So, I'd like to see the reason why we stick on the fixed length i/o size, if similar ideas were rejected before. An idea that I'd like to investigate is, PostgreSQL allocates a set of continuous buffers to fit larger i/o size when block is referenced due to sequential scan, then invokes consolidated i/o request on the buffer. It probably make sense if we can expect upcoming block references shall be on the neighbor blocks; that is typical sequential read workload. Of course, we shall need to solve some complicated stuff, like prevention of fragmentation on shared buffers, or enhancement of internal APIs of storage manager to accept larger i/o size. Furthermore, it seems to me this idea has worth to investigate. Any comments please. Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: > Hello, > > A few days before, I got a question as described in the subject line on > a discussion with my colleague. > > In general, larger i/o size per system call gives us wider bandwidth on > sequential read, than multiple system calls with smaller i/o size. > Probably, people knows this heuristics. > > On the other hand, PostgreSQL always reads database files by BLCKSZ > (= usually, 8KB) when referenced block was not on the shared buffer, > however, it doesn't seem to me it can pull maximum performance of > modern storage system. > > I'm not certain whether we had discussed this kind of ideas, or not. > So, I'd like to see the reason why we stick on the fixed length i/o size, > if similar ideas were rejected before. > > An idea that I'd like to investigate is, PostgreSQL allocates a set of > continuous buffers to fit larger i/o size when block is referenced due to > sequential scan, then invokes consolidated i/o request on the buffer. > It probably make sense if we can expect upcoming block references > shall be on the neighbor blocks; that is typical sequential read workload. > > Of course, we shall need to solve some complicated stuff, like prevention > of fragmentation on shared buffers, or enhancement of internal APIs of > storage manager to accept larger i/o size. > Furthermore, it seems to me this idea has worth to investigate. > > Any comments please. Thanks, Isn't this dealt with at least in part by effective i/o concurrency and o/s readahead? merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: >> An idea that I'd like to investigate is, PostgreSQL allocates a set of >> continuous buffers to fit larger i/o size when block is referenced due to >> sequential scan, then invokes consolidated i/o request on the buffer. > Isn't this dealt with at least in part by effective i/o concurrency > and o/s readahead? I should think so. It's very difficult to predict future block-access requirements for anything except a seqscan, and for that, we expect the OS will detect the access pattern and start reading ahead on its own. Another point here is that you could get some of the hoped-for benefit just by increasing BLCKSZ ... but nobody's ever demonstrated any compelling benefit from larger BLCKSZ (except on specialized workloads, if memory serves). The big-picture problem with work in this area is that no matter how you do it, any benefit is likely to be both platform- and workload-specific. So the prospects for getting a patch accepted aren't all that bright. regards, tom lane
2013/8/23 Tom Lane <tgl@sss.pgh.pa.us>: > Merlin Moncure <mmoncure@gmail.com> writes: >> On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: >>> An idea that I'd like to investigate is, PostgreSQL allocates a set of >>> continuous buffers to fit larger i/o size when block is referenced due to >>> sequential scan, then invokes consolidated i/o request on the buffer. > >> Isn't this dealt with at least in part by effective i/o concurrency >> and o/s readahead? > > I should think so. It's very difficult to predict future block-access > requirements for anything except a seqscan, and for that, we expect the > OS will detect the access pattern and start reading ahead on its own. > > Another point here is that you could get some of the hoped-for benefit > just by increasing BLCKSZ ... but nobody's ever demonstrated any > compelling benefit from larger BLCKSZ (except on specialized workloads, > if memory serves). > > The big-picture problem with work in this area is that no matter how you > do it, any benefit is likely to be both platform- and workload-specific. > So the prospects for getting a patch accepted aren't all that bright. > Hmm. I might overlook effect of readahead on operating system level. Indeed, sequential scan has a workload that easily launches it, so smaller i/o size in application level will be hidden. Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
> The big-picture problem with work in this area is that no matter how you > do it, any benefit is likely to be both platform- and workload-specific. > So the prospects for getting a patch accepted aren't all that bright. Indeed. Would it make sense to have something easier to configure that recompiling postgresql and managing a custom executable, say a block size that could be configured from initdb and/or postmaster.conf, or maybe per-object settings specified at creation time? Note that the block size may also affect the cache behavior, for instance for pure random accesses, more "recently accessed" tuples can be kept in memory if the pages are smaller. So there are other reasons to play with the blocksize than I/O access times, and an option to do that more easily would help. -- Fabien.
2013/8/23 Fabien COELHO <coelho@cri.ensmp.fr>: > >> The big-picture problem with work in this area is that no matter how you >> do it, any benefit is likely to be both platform- and workload-specific. >> So the prospects for getting a patch accepted aren't all that bright. > > > Indeed. > > Would it make sense to have something easier to configure that recompiling > postgresql and managing a custom executable, say a block size that could be > configured from initdb and/or postmaster.conf, or maybe per-object settings > specified at creation time? > I love the idea of per-object block size setting according to expected workload; maybe configured by DBA. In case when we have to run sequential scan on large tables, larger block size may have less pain than interruption per 8KB boundary to switch the block being currently focused on, even though random access via index scan loves smaller block size. > Note that the block size may also affect the cache behavior, for instance > for pure random accesses, more "recently accessed" tuples can be kept in > memory if the pages are smaller. So there are other reasons to play with the > blocksize than I/O access times, and an option to do that more easily would > help. > I see. Uniformed block-size could simplify the implementation, thus no need to worry about a scenario that continuous buffer allocation push out pages to be kept in memory. Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
>> Would it make sense to have something easier to configure that recompiling >> postgresql and managing a custom executable, say a block size that could be >> configured from initdb and/or postmaster.conf, or maybe per-object settings >> specified at creation time? >> > I love the idea of per-object block size setting according to expected workload; My 0.02€: wait to see whether the idea get some positive feedback by core people before investing any time in that... The per object would be a lot of work. A per initdb (so per cluster) setting (block size, wal size...) would much easier to implement, but it impacts for storage format. > large tables, larger block size may have less pain than interruption per 8KB > boundary to switch the block being currently focused on, even though random > access via index scan loves smaller block size. Yep, as Tom noted, this is really workload specific. -- Fabien.
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Aug 22, 2013 at 8:53 PM, Kohei KaiGai <spandir="ltr"><<a href="mailto:kaigai@kaigai.gr.jp" target="_blank">kaigai@kaigai.gr.jp</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":652" style="overflow:hidden">An idea that I'd like to investigate is, PostgreSQL allocates a set of<br /> continuous buffers tofit larger i/o size when block is referenced due to<br /> sequential scan, then invokes consolidated i/o request on thebuffer.<br /> It probably make sense if we can expect upcoming block references<br /> shall be on the neighbor blocks;that is typical sequential read workload.</div></blockquote></div><br /></div><div class="gmail_extra">I think itmakes more sense to use scatter gather i/o or async i/o to read to regular sized buffers scattered around memory than torestrict the buffers to needing to be contiguous.<br /><br /></div><div class="gmail_extra">As others said, Postgres dependson the OS buffer cache to do readahead. The scenario where the above becomes interesting is if it's paired with amove to directio or other ways of skipping the buffer cache. Double caching is a huge waste and leads to lots of inefficiencies.<br /><br /></div><div class="gmail_extra">The blocking issue there is that Postgres doesn't understand muchabout the underlying hardware storage. If there were APIs to find out more about it from the kernel -- how much furtherbefore the end of the raid chunk, how much parallelism it has, how congested the i/o channel is, etc -- then Postgresmight be on par with the kernel and able to eliminate the double buffering inefficiency and might even be able todo better if it understands its own workload better.<br /><br /></div><div class="gmail_extra">If Postgres did that thenit would be necessary to be able to initiate i/o on multiple buffers in parallel. That can be done using scatter gatheri/o such as readv() and writev() but that would mean blocking on reading blocks that might not be needed until thefuture. Or it could be done using libaio to initiate i/o and return control as soon as the needed data is available whileother i/o is still pending.<br /><br /></div><div class="gmail_extra"><br /></div><div class="gmail_extra">-- <br />greg<br/></div></div>
Tom Lane <tgl@sss.pgh.pa.us> > Another point here is that you could get some of the hoped-for > benefit just by increasing BLCKSZ ... but nobody's ever > demonstrated any compelling benefit from larger BLCKSZ (except on > specialized workloads, if memory serves). I think I've seen a handful of reports of performance differences with different BLCKSZ builds (perhaps not all on community lists). My recollection is that some people sifting through data in data warehouse environments see a performance benefit up to 32KB, but that tests of GiST index performance with different sizes showed better performance with smaller sizes down to around 2KB. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin, > I think I've seen a handful of reports of performance differences > with different BLCKSZ builds (perhaps not all on community lists). > My recollection is that some people sifting through data in data > warehouse environments see a performance benefit up to 32KB, but > that tests of GiST index performance with different sizes showed > better performance with smaller sizes down to around 2KB. I believe that Greenplum currently uses 128K. There's a definite benefit for the DW use-case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 8/27/13 3:54 PM, Josh Berkus wrote: > I believe that Greenplum currently uses 128K. There's a definite > benefit for the DW use-case. Since Linux read-ahead can easily give big gains on fast storage, I normally set that to at least 4096 sectors = 2048KB. That's a lot bigger than even this, and definitely necessary for reaching maximum storage speed. I don't think that the block size change alone will necessarily duplicate the gains on seq scans that Greenplum gets though. They've done a lot more performance optimization on that part of the read path than just the larger block size. As far as quantifying whether this is worth chasing, the most useful thing to do here is find some fast storage and profile the code with different block sizes at a large read-ahead. I wouldn't spend a minute on trying to come up with a more complicated management scheme until the potential gain is measured. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com