Thread: Re: Read/Write block sizes (Was: Caching by Postgres)
> Does that include increasing the size of read/write blocks? I've > noticedthat with a large enough table it takes a while to do a > sequential scan, > even if it's cached; I wonder if the fact that it takes a million > read(2) calls to get through an 8G table is part of that. > Actually some of that readaheads,etc the OS does already if it does some sort of throttling/clubbing of reads/writes. Butits not enough for such types of workloads. Here is what I think will help: * Support for different Blocksize TABLESPACE without recompiling the code.. (Atlease support for a different Blocksize forthe whole database without recompiling the code) * Support for bigger sizes of WAL files instead of 16MB files WITHOUT recompiling the code.. Should be a tuneable if youask me (with checkpoint_segments at 256.. you have too many 16MB files in the log directory) (This will help OLTP benchmarksmore since now they don't spend time rotating log files) * Introduce a multiblock or extent tunable variable where you can define a multiple of 8K (or BlockSize tuneable) to reada bigger chunk and store it in the bufferpool.. (Maybe writes too) (Most devices now support upto 1MB chunks for readsand writes) *There should be a way to preallocate files for TABLES in TABLESPACES otherwise with multiple table writes in the same filesystemends with fragmented files which causes poor "READS" from the files. * With 64bit 1GB file chunks is also moot.. Maybe it should be tuneable too like 100GB without recompiling the code. Why recompiling is bad? Most companies that will support Postgres will support their own binaries and they won't prefer differentversions of binaries for different blocksizes, different WAL file sizes, etc... and hence more function using thesame set of binaries is more desirable in enterprise environments Regards, Jignesh
J.K.Shah@Sun.COM (Jignesh Shah) writes: >> Does that include increasing the size of read/write blocks? I've >> noticedthat with a large enough table it takes a while to do a >> sequential scan, even if it's cached; I wonder if the fact that it >> takes a million read(2) calls to get through an 8G table is part of >> that. > > Actually some of that readaheads,etc the OS does already if it does > some sort of throttling/clubbing of reads/writes. But its not enough > for such types of workloads. > > Here is what I think will help: > > * Support for different Blocksize TABLESPACE without recompiling the > code.. (Atlease support for a different Blocksize for the whole > database without recompiling the code) > > * Support for bigger sizes of WAL files instead of 16MB files > WITHOUT recompiling the code.. Should be a tuneable if you ask me > (with checkpoint_segments at 256.. you have too many 16MB files in > the log directory) (This will help OLTP benchmarks more since now > they don't spend time rotating log files) > > * Introduce a multiblock or extent tunable variable where you can > define a multiple of 8K (or BlockSize tuneable) to read a bigger > chunk and store it in the bufferpool.. (Maybe writes too) (Most > devices now support upto 1MB chunks for reads and writes) > > *There should be a way to preallocate files for TABLES in > TABLESPACES otherwise with multiple table writes in the same > filesystem ends with fragmented files which causes poor "READS" from > the files. > > * With 64bit 1GB file chunks is also moot.. Maybe it should be > tuneable too like 100GB without recompiling the code. > > Why recompiling is bad? Most companies that will support Postgres > will support their own binaries and they won't prefer different > versions of binaries for different blocksizes, different WAL file > sizes, etc... and hence more function using the same set of binaries > is more desirable in enterprise environments Every single one of these still begs the question of whether the changes will have a *material* impact on performance. What we have been finding, as RAID controllers get smarter, is that it is getting increasingly futile to try to attach knobs to 'disk stuff;' it is *way* more effective to add a few more spindles to an array than it is to fiddle with which disks are to be allocated to what database 'objects.' The above suggested 'knobs' are all going to add to complexity and it is NOT evident that any of them will forcibly help. I could be wrong; code contributions combined with Actual Benchmarking would be the actual proof of the merits of the ideas. But it also suggests another question, namely... Will these represent more worthwhile improvements to speed than working on other optimizations that are on the TODO list? If someone spends 100h working on one of these items, and gets a 2% performance improvement, that's almost certain to be less desirable than spending 50h on something else that gets a 4% improvement. And we might discover that memory management improvements in Linux 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such improvements "for free" behind our backs without *any* need to write database code. :-) -- let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;; http://www3.sympatico.ca/cbbrowne/postgresql.html Wiener's Law of Libraries: There are no answers, only cross references.
On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote: >Actually some of that readaheads,etc the OS does already if it does >some sort of throttling/clubbing of reads/writes. Note that I specified the fully cached case--even with the workload in RAM the system still has to process a heck of a lot of read calls. >* Introduce a multiblock or extent tunable variable where you can >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk >and store it in the bufferpool.. (Maybe writes too) (Most devices now >support upto 1MB chunks for reads and writes) Yeah. The problem with relying on OS readahead is that the OS doesn't know whether you're doing a sequential scan or an index scan; if you have the OS agressively readahead you'll kill your seek performance. OTOH, if you don't do readaheads you'll kill your sequential scan performance. At the app level you know which makes sense for each operation. Mike Stone
On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: >What we have been finding, as RAID controllers get smarter, is that it >is getting increasingly futile to try to attach knobs to 'disk stuff;' >it is *way* more effective to add a few more spindles to an array than >it is to fiddle with which disks are to be allocated to what database >'objects.' That statement doesn't say anything about trying to maximize performance to or from a disk array. Yes, controllers are getting smarter--but they aren't omnicient. IME an I/O bound sequential table scan doesn't get data moving off the disk nearly as fast as say, a dd with a big ibs. Why? There's obviously a lot of factors at work, but one of those factors is that the raid controller can optimize "grab this meg" a lot more than it can optimize "grab this 8k". Mike Stone
On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: > J.K.Shah@Sun.COM (Jignesh Shah) writes: > >> Does that include increasing the size of read/write blocks? I've > >> noticedthat with a large enough table it takes a while to do a > >> sequential scan, even if it's cached; I wonder if the fact that it > >> takes a million read(2) calls to get through an 8G table is part of > >> that. > > > > Actually some of that readaheads,etc the OS does already if it does > > some sort of throttling/clubbing of reads/writes. But its not enough > > for such types of workloads. > > > > Here is what I think will help: > > > > * Support for different Blocksize TABLESPACE without recompiling the > > code.. (Atlease support for a different Blocksize for the whole > > database without recompiling the code) > > > > * Support for bigger sizes of WAL files instead of 16MB files > > WITHOUT recompiling the code.. Should be a tuneable if you ask me > > (with checkpoint_segments at 256.. you have too many 16MB files in > > the log directory) (This will help OLTP benchmarks more since now > > they don't spend time rotating log files) > > > > * Introduce a multiblock or extent tunable variable where you can > > define a multiple of 8K (or BlockSize tuneable) to read a bigger > > chunk and store it in the bufferpool.. (Maybe writes too) (Most > > devices now support upto 1MB chunks for reads and writes) > > > > *There should be a way to preallocate files for TABLES in > > TABLESPACES otherwise with multiple table writes in the same > > filesystem ends with fragmented files which causes poor "READS" from > > the files. > > > > * With 64bit 1GB file chunks is also moot.. Maybe it should be > > tuneable too like 100GB without recompiling the code. > > > > Why recompiling is bad? Most companies that will support Postgres > > will support their own binaries and they won't prefer different > > versions of binaries for different blocksizes, different WAL file > > sizes, etc... and hence more function using the same set of binaries > > is more desirable in enterprise environments > > Every single one of these still begs the question of whether the > changes will have a *material* impact on performance. How many of these things are currently easy to change with a recompile? I should be able to start testing some of these ideas in the near future, if they only require minor code or configure changes. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
Chris, Unless I am wrong, you're making the assumpting the amount of time spent and ROI is known. Maybe those who've been down this path know how to get that additional 2-4% in 30 minutes or less? While each person and business' performance gains (or not) could vary, someone spending the 50-100h to gain 2-4% over a course of a month for a 24x7 operation would seem worth the investment? I would assume that dbt2 with STP helps minimize the amount of hours someone has to invest to determine performance gains with configurable options? Steve Poe > If someone spends 100h working on one of these items, and gets a 2% > performance improvement, that's almost certain to be less desirable > than spending 50h on something else that gets a 4% improvement. > > And we might discover that memory management improvements in Linux > 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such > improvements "for free" behind our backs without *any* need to write > database code. :-)
On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote: > On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote: > >Actually some of that readaheads,etc the OS does already if it does > >some sort of throttling/clubbing of reads/writes. > > Note that I specified the fully cached case--even with the workload in > RAM the system still has to process a heck of a lot of read calls. > > >* Introduce a multiblock or extent tunable variable where you can > >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk > >and store it in the bufferpool.. (Maybe writes too) (Most devices now > >support upto 1MB chunks for reads and writes) > > Yeah. The problem with relying on OS readahead is that the OS doesn't > know whether you're doing a sequential scan or an index scan; if you > have the OS agressively readahead you'll kill your seek performance. > OTOH, if you don't do readaheads you'll kill your sequential scan > performance. At the app level you know which makes sense for each > operation. This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL. -jwb
Hi Jim, | How many of these things are currently easy to change with a recompile? | I should be able to start testing some of these ideas in the near | future, if they only require minor code or configure changes. The following * Data File Size 1GB * WAL File Size of 16MB * Block Size of 8K Are very easy to change with a recompile.. A Tunable will be greatly prefered as it will allow one binary for different tunings * MultiBlock read/write Is not available but will greatly help in reducing the number of system calls which will only increase as the size of the database increases if something is not done about i. * Pregrown files... maybe not important at this point since TABLESPACE can currently work around it a bit (Just need to create a different file system for each tablespace But if you really think hardware & OS is the answer for all small things...... I think we should now start to look on how to make Postgres Multi-threaded or multi-processed for each connection. With the influx of "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the cutting edge if somehow exploiting cores is designed. Somebody mentioned that adding CPU to Postgres workload halved the average CPU usage... YEAH... PostgreSQL uses only 1 CPU per connection (assuming 100% usage) so if you add another CPU it is idle anyway and the system will report only 50% :-) BUT the importing to measure is.. whether the query time was cut down or not? ( No flames I am sure you were talking about multi-connection multi-user environment :-) ) But my point is then this approach is worth the ROI and the time and effort spent to solve this problem. I actually vote for a multi-threaded solution for each connection while still maintaining seperate process for each connections... This way the fundamental architecture of Postgres doesn't change, however a multi-threaded connection can then start to exploit different cores.. (Maybe have tunables for number of threads to read data files who knows.. If somebody is interested in actually working a design .. contact me and I will be interested in assisting this work. Regards, Jignesh Jim C. Nasby wrote: >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: > > >>J.K.Shah@Sun.COM (Jignesh Shah) writes: >> >> >>>>Does that include increasing the size of read/write blocks? I've >>>>noticedthat with a large enough table it takes a while to do a >>>>sequential scan, even if it's cached; I wonder if the fact that it >>>>takes a million read(2) calls to get through an 8G table is part of >>>>that. >>>> >>>> >>>Actually some of that readaheads,etc the OS does already if it does >>>some sort of throttling/clubbing of reads/writes. But its not enough >>>for such types of workloads. >>> >>>Here is what I think will help: >>> >>>* Support for different Blocksize TABLESPACE without recompiling the >>>code.. (Atlease support for a different Blocksize for the whole >>>database without recompiling the code) >>> >>>* Support for bigger sizes of WAL files instead of 16MB files >>>WITHOUT recompiling the code.. Should be a tuneable if you ask me >>>(with checkpoint_segments at 256.. you have too many 16MB files in >>>the log directory) (This will help OLTP benchmarks more since now >>>they don't spend time rotating log files) >>> >>>* Introduce a multiblock or extent tunable variable where you can >>>define a multiple of 8K (or BlockSize tuneable) to read a bigger >>>chunk and store it in the bufferpool.. (Maybe writes too) (Most >>>devices now support upto 1MB chunks for reads and writes) >>> >>>*There should be a way to preallocate files for TABLES in >>>TABLESPACES otherwise with multiple table writes in the same >>>filesystem ends with fragmented files which causes poor "READS" from >>>the files. >>> >>>* With 64bit 1GB file chunks is also moot.. Maybe it should be >>>tuneable too like 100GB without recompiling the code. >>> >>>Why recompiling is bad? Most companies that will support Postgres >>>will support their own binaries and they won't prefer different >>>versions of binaries for different blocksizes, different WAL file >>>sizes, etc... and hence more function using the same set of binaries >>>is more desirable in enterprise environments >>> >>> >>Every single one of these still begs the question of whether the >>changes will have a *material* impact on performance. >> >>
Steve, > I would assume that dbt2 with STP helps minimize the amount of hours > someone has to invest to determine performance gains with configurable > options? Actually, these I/O operation issues show up mainly with DW workloads, so the STP isn't much use there. If I can ever get some of these machines back from the build people, I'd like to start testing some stuff. One issue with testing this is that currently PostgreSQL doesn't support block sizes above 128K. We've already done testing on that (well, Mark has) and the performance gains aren't even worth the hassle of remembering you're on a different block size (like, +4%). What the Sun people have done with other DB systems is show that substantial performance gains are possible on large databases (>100G) using block sizes of 1MB. I believe that's possible (and that it probably makes more of a difference on Solaris than on BSD) but we can't test it without some hackery first. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: >Steve, > > > >>I would assume that dbt2 with STP helps minimize the amount of hours >>someone has to invest to determine performance gains with configurable >>options? >> >> > >Actually, these I/O operation issues show up mainly with DW workloads, so the >STP isn't much use there. If I can ever get some of these machines back >from the build people, I'd like to start testing some stuff. > >One issue with testing this is that currently PostgreSQL doesn't support block >sizes above 128K. We've already done testing on that (well, Mark has) and >the performance gains aren't even worth the hassle of remembering you're on a >different block size (like, +4%). > > What size database was this on? >What the Sun people have done with other DB systems is show that substantial >performance gains are possible on large databases (>100G) using block sizes >of 1MB. I believe that's possible (and that it probably makes more of a >difference on Solaris than on BSD) but we can't test it without some hackery >first. > We're running on a 100+GB database, with long streams of 8KB reads with the occasional _llseek(). I've been thinking about running with a larger blocksize with the expectation that we'd see fewer system calls and a bit more throughput. read() calls are a very expensive way to get 8KB of memory (that we know is already resident) during scans. One has to trap into the kernel, do the usual process state accounting, find the block, copy the memory to userspace, return back from the kernel to user space reversing all the process accounting, pick out the bytes one needs, and repeat all over again. That's quite a few sacrificial cache lines for 8KB. Yeah, sure, Linux syscalls are fast, but they aren't that fast, and other operating systems (windows and solaris) have a bit more overhead on syscalls. Regarding large blocks sizes on Solaris: the Solaris folks can also use large memory pages and avoid a lot of the TLB overhead from the VM system. The various trapstat and cpustat commands can be quite interesting to look at when running any large application on a Solaris system. It should be noted that having a large shared memory segment can be a performance looser just from the standpoint of TLB thrashing. O(GB) memory access patterns can take a huge performance hit in user space with 4K pages compared to the kernel which would be mapping the "segmap" (in Solaris parlance) with 4MB pages. Anyway, I guess my point is that the balance between kernel managed vs. postgresql managed buffer isn't obvious at all. -- Alan
On Tue, 2005-08-23 at 19:31 -0700, Josh Berkus wrote: > Steve, > > > I would assume that dbt2 with STP helps minimize the amount of hours > > someone has to invest to determine performance gains with configurable > > options? > > Actually, these I/O operation issues show up mainly with DW workloads, so the > STP isn't much use there. If I can ever get some of these machines back > from the build people, I'd like to start testing some stuff. > > One issue with testing this is that currently PostgreSQL doesn't support block > sizes above 128K. We've already done testing on that (well, Mark has) and > the performance gains aren't even worth the hassle of remembering you're on a > different block size (like, +4%). > > What the Sun people have done with other DB systems is show that substantial > performance gains are possible on large databases (>100G) using block sizes > of 1MB. I believe that's possible (and that it probably makes more of a > difference on Solaris than on BSD) but we can't test it without some hackery > first. To get decent I/O you need 1MB fundamental units all the way down the stack. You need a filesystem that can take a 1MB write well, and you need an I/O scheduler that will keep it together, and you need a storage controller that can eat a 1MB request at once. Ideally you'd like an architecture with a 1MB page (Itanium has this, and AMD64 Linux will soon have this.) The Lustre people have done some work in this area, opening up the datapaths in the kernel so they can keep the hardware really working. They even modified the QLogic SCSI/FC driver so it supports such large transfers. Their work has shown that you can get significant perf boost on Linux just by thinking in terms of larger transfers. Unfortunately I'm really afraid that this conversation is about trees when the forest is the problem. PostgreSQL doesn't even have an async reader, which is the sort of thing that could double or triple its performance. You're talking about block sizes and such, but the kinds of improvements you can get there are in the tens of percents at most. -jwb
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > To get decent I/O you need 1MB fundamental units all the way down the > stack. It would also be a good idea to have an application that isn't likely to change a single bit in a 1MB range and then expect you to record that change. This pretty much lets Postgres out of the picture. regards, tom lane
> Unfortunately I'm really afraid that this conversation is about trees > when the forest is the problem. PostgreSQL doesn't even have an async > reader, which is the sort of thing that could double or triple its > performance. You're talking about block sizes and such, but the kinds > of improvements you can get there are in the tens of percents at most. Not 100% sure, but I'm fairly cirtain we were seeing significant performance degradation by too much _scheduled_ I/O activity ie: too much work being submitted to the kernel, due to excessive parallelism already!! The classic example of this is a seqscan being interleved by a index scan, and the disks end up doing nothing but seek activity Out of all the stuff talked about on this thread so far, only tweaking the block size (and the madvise() stuff) makes any real-world sense, as its the only thing talked about that increases the _work_per_seek_. As for the async IO, sure you might think 'oh async IO would be so cool!!' and I did, once, too. But then I sat down and _thought_ about it, and decided well, no, actually, theres _very_ few areas it could actually help, and in most cases it just make it easier to drive your box into lseek() induced IO collapse. Dont forget that already in postgres, you have a process per connection, and all the processes take care of their own I/O. Somebody mentioned having threaded backends too, but the only benefit would be reduced memory footprint (a backend consumes 1-2MB of RAM, which is almost enough to be a concern for largish systems with a lot of backends) but personally I _know_ the complixities introduced through threading are usually not worth it. IMVVHO (naive experience) what is needed is a complete architecture change (probably infeasible and only useful as a thought experiment), where: * a network I/O process deals with client connections * a limited pool of worker processes deal with statements (perhaps related to number of spindles somehow) so when a client issues a statement, the net-IO process simply forwards the connection state to a worker process and says 'deal with this'. (Clearly the state object needs to contain all user and transaction state the connection is involved in). - Guy Thornley
On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote: > As for the async IO, sure you might think 'oh async IO would be so cool!!' > and I did, once, too. But then I sat down and _thought_ about it, and > decided well, no, actually, theres _very_ few areas it could actually help, > and in most cases it just make it easier to drive your box into lseek() > induced IO collapse. > > Dont forget that already in postgres, you have a process per connection, and > all the processes take care of their own I/O. That's the problem. Instead you want 1 or 4 or 10 i/o slaves coordinating the I/O of all the backends optimally. For instance, with synchronous scanning. -jwb
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote: >> Dont forget that already in postgres, you have a process per connection, and >> all the processes take care of their own I/O. > That's the problem. Instead you want 1 or 4 or 10 i/o slaves > coordinating the I/O of all the backends optimally. For instance, with > synchronous scanning. And why exactly are we going to do a better job of I/O scheduling than the OS itself can do? There's a fairly basic disconnect in viewpoint involved here. The old-school viewpoint (as embodied in Oracle and a few other DBMSes) is that the OS is too stupid to be worth anything, and the DB should bypass the OS to the greatest extent possible, doing its own caching, disk space layout, I/O scheduling, yadda yadda. That might have been defensible twenty-odd years ago when Oracle was designed. Postgres prefers to lay off to the OS anything that the OS can do well --- and that definitely includes caching and I/O scheduling. There are a whole lot of smart people working on those problems at the OS level. Maybe we could make marginal improvements on their results after spending a lot of effort reinventing the wheel ... but our time will be repaid much more if we work at levels that the OS cannot have knowledge of, such as join planning and data statistics. There are some things we could do to reduce the impedance between us and the OS --- for instance, the upthread criticism that a seqscan asks the OS for only 8K at a time is fair enough. But that doesn't translate to a conclusion that we should schedule the I/O instead of the OS. regards, tom lane
On Wed, 2005-08-24 at 01:56 -0400, Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote: > >> Dont forget that already in postgres, you have a process per connection, and > >> all the processes take care of their own I/O. > > > That's the problem. Instead you want 1 or 4 or 10 i/o slaves > > coordinating the I/O of all the backends optimally. For instance, with > > synchronous scanning. > > And why exactly are we going to do a better job of I/O scheduling than > the OS itself can do? ... > There are some things we could do to reduce the impedance between us and > the OS --- for instance, the upthread criticism that a seqscan asks the > OS for only 8K at a time is fair enough. But that doesn't translate > to a conclusion that we should schedule the I/O instead of the OS. Synchronous scanning is a fairly huge and obvious win. If you have two processes 180 degrees out-of-phase in a linear read, neither process is going to get anywhere near the throughput they would get from a single scan. I think you're being deliberately obtuse with regards to file I/O and the operating system. The OS isn't magical. It has to strike a balance between a reasonable read latency and a reasonable throughput. As far as the kernel is concerned, a busy postgresql server is indistinguishable from 100 unrelated activities. All backends will be served equally, even if in this case "equally" means "quite badly all around." An I/O slave process could be a big win in Postgres for many kinds of reads. Instead of opening and reading files the backends would connect to the I/O slave and request the file be read. If a scan of that file were already underway, the new backends would be attached. Otherwise a new scan would commence. In either case, the slave process can issue (sometimes non-dependant) reads well ahead of the needs of the backend. You may think the OS can do this for you but it can't. On postgres knows that it needs the whole file from beginning to end. The OS can only guess. Ask me sometime about my replacement for GNU sort. It uses the same sorting algorithm, but it's an order of magnitude faster due to better I/O strategy. Someday, in my infinite spare time, I hope to demonstrate that kind of improvement with a patch to pg. -jwb
> of effort reinventing the wheel ... but our time will be repaid much > more if we work at levels that the OS cannot have knowledge of, such as > join planning and data statistics. Considering a global budget of man-hours which is the best ? 1- Spend it on reimplementing half of VFS in postgres, half of Windows in postgres, half of FreeBSD in postgres, half of Solaris in Postgres, only to discover you gain a meagre speed increase and a million and a half bugs, 2- Spending 5% of that time lowering the impedance between the OS and Postgres, and another 5% annoying Kernel people and helping them tweaking stuff for database use, and the rest on useful features that give useful speedups, like bitmap indexes, skip scans, and other features that enhance power and usability ? If you're Oracle and have almost unlimited resources, maybe. But even Microsoft opted for option 2 : they implemented ReadFileGather and WriteFileScatter to lower the syscall overhead and that's it. And point 2 will benefit to many other apps, wether 1 would benefit only postgres, and then only in certain cases. I do believe there is something ineresting to uncover with reiser4 though (it definitely fits point 2). I'm happy that the pg team chose point 2 and that new versions keep coming with new features at an unbelievable rate these times. Do you guys sleep ?
This thread covers several performance ideas. First is the idea that more parameters should be configurable. While this seems like a noble goal, we try to make parameters auto-tuning, or if users have to configure it, the parameter should be useful for a significant number of users. In the commercial software world, if you can convince your boss that a feature/knob is useful, it usually gets into the product. Unfortunately, this leads to the golden doorknob on a shack, where some features are out of sync with the rest of the product in terms of usefulness and utility. With open source, if a feature can not be auto-tuned, or has significant overhead, the features has to be implemented and then proven to be a benefit. In terms of adding async I/O, threading, and other things, it might make sense to explore how these could be implemented in a way that fits the above criteria. --------------------------------------------------------------------------- Jignesh K. Shah wrote: > Hi Jim, > > | How many of these things are currently easy to change with a recompile? > | I should be able to start testing some of these ideas in the near > | future, if they only require minor code or configure changes. > > > The following > * Data File Size 1GB > * WAL File Size of 16MB > * Block Size of 8K > > Are very easy to change with a recompile.. A Tunable will be greatly > prefered as it will allow one binary for different tunings > > * MultiBlock read/write > > Is not available but will greatly help in reducing the number of system > calls which will only increase as the size of the database increases if > something is not done about i. > > * Pregrown files... maybe not important at this point since TABLESPACE > can currently work around it a bit (Just need to create a different file > system for each tablespace > > But if you really think hardware & OS is the answer for all small > things...... I think we should now start to look on how to make Postgres > Multi-threaded or multi-processed for each connection. With the influx > of "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the > cutting edge if somehow exploiting cores is designed. > > Somebody mentioned that adding CPU to Postgres workload halved the > average CPU usage... > YEAH... PostgreSQL uses only 1 CPU per connection (assuming 100% > usage) so if you add another CPU it is idle anyway and the system will > report only 50% :-) BUT the importing to measure is.. whether the query > time was cut down or not? ( No flames I am sure you were talking about > multi-connection multi-user environment :-) ) But my point is then this > approach is worth the ROI and the time and effort spent to solve this > problem. > > I actually vote for a multi-threaded solution for each connection while > still maintaining seperate process for each connections... This way the > fundamental architecture of Postgres doesn't change, however a > multi-threaded connection can then start to exploit different cores.. > (Maybe have tunables for number of threads to read data files who > knows.. If somebody is interested in actually working a design .. > contact me and I will be interested in assisting this work. > > Regards, > Jignesh > > > Jim C. Nasby wrote: > > >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: > > > > > >>J.K.Shah@Sun.COM (Jignesh Shah) writes: > >> > >> > >>>>Does that include increasing the size of read/write blocks? I've > >>>>noticedthat with a large enough table it takes a while to do a > >>>>sequential scan, even if it's cached; I wonder if the fact that it > >>>>takes a million read(2) calls to get through an 8G table is part of > >>>>that. > >>>> > >>>> > >>>Actually some of that readaheads,etc the OS does already if it does > >>>some sort of throttling/clubbing of reads/writes. But its not enough > >>>for such types of workloads. > >>> > >>>Here is what I think will help: > >>> > >>>* Support for different Blocksize TABLESPACE without recompiling the > >>>code.. (Atlease support for a different Blocksize for the whole > >>>database without recompiling the code) > >>> > >>>* Support for bigger sizes of WAL files instead of 16MB files > >>>WITHOUT recompiling the code.. Should be a tuneable if you ask me > >>>(with checkpoint_segments at 256.. you have too many 16MB files in > >>>the log directory) (This will help OLTP benchmarks more since now > >>>they don't spend time rotating log files) > >>> > >>>* Introduce a multiblock or extent tunable variable where you can > >>>define a multiple of 8K (or BlockSize tuneable) to read a bigger > >>>chunk and store it in the bufferpool.. (Maybe writes too) (Most > >>>devices now support upto 1MB chunks for reads and writes) > >>> > >>>*There should be a way to preallocate files for TABLES in > >>>TABLESPACES otherwise with multiple table writes in the same > >>>filesystem ends with fragmented files which causes poor "READS" from > >>>the files. > >>> > >>>* With 64bit 1GB file chunks is also moot.. Maybe it should be > >>>tuneable too like 100GB without recompiling the code. > >>> > >>>Why recompiling is bad? Most companies that will support Postgres > >>>will support their own binaries and they won't prefer different > >>>versions of binaries for different blocksizes, different WAL file > >>>sizes, etc... and hence more function using the same set of binaries > >>>is more desirable in enterprise environments > >>> > >>> > >>Every single one of these still begs the question of whether the > >>changes will have a *material* impact on performance. > >> > >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
spoe@sfnet.cc (Steve Poe) writes: > Chris, > > Unless I am wrong, you're making the assumpting the amount of time spent > and ROI is known. Maybe those who've been down this path know how to get > that additional 2-4% in 30 minutes or less? > > While each person and business' performance gains (or not) could vary, > someone spending the 50-100h to gain 2-4% over a course of a month for a > 24x7 operation would seem worth the investment? What we *do* know is that adding these "knobs" would involve a significant amount of effort, as the values are widely used throughout the database engine. Making them dynamic (e.g. - so they could be tuned on a tablespace-by-tablespace basis) would undoubtedly require rather a lot of development effort. They are definitely NOT 30 minute changes. Moreover, knowing how to adjust them is almost certainly also NOT a 30 minute configuration change; significant benchmarking effort for the individual application is almost sure to be needed. It's not much different from the reason why PostgreSQL doesn't use threading... The problem with using threading is that introducing it to the code base would require a pretty enormous amount of effort (I'll bet multiple person-years), and it wouldn't provide *any* benefit until you get rather a long ways down the road. Everyone involved in development seems to me to have a reasonably keen understanding as to what the potential benefits of threading are; the value is that there fall out plenty of opportunities to parallelize the evaluation of portions of queries. Alas, it wouldn't be until *after* all the effort goes in that we would get any idea as to what kinds of speedups this would provide. In effect, there has to be a year invested in *breaking* PostgreSQL (because this would initially break a lot, since thread programming is a really tough skill) where you don't actually see any benefits. > I would assume that dbt2 with STP helps minimize the amount of hours > someone has to invest to determine performance gains with > configurable options? That's going to help in constructing a "default" knob value. And if we find an "optimal default," that encourages sticking with the current approach, of using #define to apply that value... >> If someone spends 100h working on one of these items, and gets a 2% >> performance improvement, that's almost certain to be less desirable >> than spending 50h on something else that gets a 4% improvement. >> >> And we might discover that memory management improvements in Linux >> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such >> improvements "for free" behind our backs without *any* need to write >> database code. :-) -- let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];; http://www.ntlug.org/~cbbrowne/lisp.html "For those of you who are into writing programs that are as obscure and complicated as possible, there are opportunities for... real fun here" -- Arthur Norman
Tom, Gavin, > > To get decent I/O you need 1MB fundamental units all the way down the > > stack. > > It would also be a good idea to have an application that isn't likely > to change a single bit in a 1MB range and then expect you to record > that change. This pretty much lets Postgres out of the picture. We're looking at this pretty much just for data warehousing, where you constantly have gigabytes of data which don't change from month to month or even year to year. I agree that it would *not* be an optimization for OLTP systems. Which is why a build-time option would be fine. > Ummm... I don't see anything here which will be a win for Postgres. The > transactional semantics we're interested in are fairly complex: > > 1) Modifications to multiple objects can become visible to the system > atomically > 2) On error, a series of modifications which had been grouped together > within a transaction can be rolled back > 3) Using object version information, determine which version of which > object is visible to a given session > 4) Using version information and locking, detect and resolve read/write > and write/write conflicts I wasn't thinking of database transactions. I was thinking specifically of using Reiser4 transactions (and other transactional filesytems) to do things like eliminate the need for full page writes in the WAL. Filesystems are low-level things which should take care of low-level needs, like making sure an 8K page got written to disk even in the event of a system failure. -- Josh Berkus Aglio Database Solutions San Francisco
On Wed, Aug 24, 2005 at 12:12:22PM -0400, Chris Browne wrote: > Everyone involved in development seems to me to have a reasonably keen > understanding as to what the potential benefits of threading are; the > value is that there fall out plenty of opportunities to parallelize > the evaluation of portions of queries. Alas, it wouldn't be until > *after* all the effort goes in that we would get any idea as to what > kinds of speedups this would provide. My understanding is that the original suggestion was to use threads within individual backends to allow for parallel query execution, not swiching to a completely thread-based model. In any case, there are other ways to enable parallelism without using threads, such as handing actual query execution off to a set of processes. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
Jeff, > Ask me sometime about my replacement for GNU sort. It uses the same > sorting algorithm, but it's an order of magnitude faster due to better > I/O strategy. Someday, in my infinite spare time, I hope to demonstrate > that kind of improvement with a patch to pg. Since we desperately need some improvements in sort performance, I do hope you follow up on this. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
At 03:45 PM 8/25/2005, Josh Berkus wrote: >Jeff, > > > Ask me sometime about my replacement for GNU sort. Â It uses the same > > sorting algorithm, but it's an order of magnitude faster due to better > > I/O strategy. Â Someday, in my infinite spare time, I hope to demonstrate > > that kind of improvement with a patch to pg. > >Since we desperately need some improvements in sort performance, I do hope >you follow up on this. > >-- >--Josh I'll generalize that. IMO we desperately need any and all improvements in IO performance. Even more so than we need improvements in sorting or sorting IO performance. Ron
rjpeace@earthlink.net (Ron) writes: > At 03:45 PM 8/25/2005, Josh Berkus wrote: >> > Ask me sometime about my replacement for GNU sort. Â It uses the >> > same sorting algorithm, but it's an order of magnitude faster due >> > to better I/O strategy. Â Someday, in my infinite spare time, I >> > hope to demonstrate that kind of improvement with a patch to pg. >> >>Since we desperately need some improvements in sort performance, I >>do hope you follow up on this. > > I'll generalize that. IMO we desperately need any and all > improvements in IO performance. Even more so than we need > improvements in sorting or sorting IO performance. That's frankly a step backwards. Feel free to "specialise" that instead. A patch that improves some specific aspect of performance is a thousand times better than any sort of "desperate desire for any and all improvements in I/O performance." The latter is unlikely to provide any usable result. The "specialized patch" is also pointedly better in that a *confidently submitted* patch is likely to be way better than any sort of "desperate clutching at whatever may come to hand." Far too often, I see people trying to address performance problems via the "desperate clutching at whatever seems near to hand," and that generally turns out very badly as a particular result of the whole "desperate clutching" part. If you can get a sort improvement submitted, that's a concrete improvement... -- select 'cbbrowne' || '@' || 'ntlug.org'; http://www3.sympatico.ca/cbbrowne/lisp.html Appendium to the Rules of the Evil Overlord #1: "I will not build excessively integrated security-and-HVAC systems. They may be Really Cool, but are far too vulnerable to breakdowns."
At 04:49 PM 8/25/2005, Chris Browne wrote: >rjpeace@earthlink.net (Ron) writes: > > At 03:45 PM 8/25/2005, Josh Berkus wrote: > >> > Ask me sometime about my replacement for GNU sort. Â It uses the > >> > same sorting algorithm, but it's an order of magnitude faster due > >> > to better I/O strategy. Â Someday, in my infinite spare time, I > >> > hope to demonstrate that kind of improvement with a patch to pg. > >> > >>Since we desperately need some improvements in sort performance, I > >>do hope you follow up on this. > > > > I'll generalize that. IMO we desperately need any and all > > improvements in IO performance. Even more so than we need > > improvements in sorting or sorting IO performance. > >That's frankly a step backwards. Feel free to "specialise" that instead. We can agree to disagree, I'm cool with that. I'm well aware that a Systems Approach to SW Architecture is not always popular in the Open Source world. Nonetheless, my POV is that if we want to be taken seriously and beat "the big boys", we have to do everything smarter and faster, as well as cheaper, than they do. You are not likely to be able to do that consistently without using some of the "icky" stuff one is required to study as part of formal training in the Comp Sci and SW Engineering fields. >A patch that improves some specific aspect of >performance is a thousand times better than any >sort of "desperate desire for any and >all improvements in I/O performance." minor twisting of my words: substituting "desire" for "need". The need is provable. Just put "the big 5" (SQL Server, Oracle, DB2, mySQL, and PostgreSQL) into some realistic benches to see that. Major twisting of my words: the apparent implication by you that I don't appreciate improvements in the IO behavior of specific things like sorting as much as I'd appreciate more "general" IO performance improvements. Performance optimization is best done as an iterative improvement process that starts with measuring where the need is greatest, then improving that greatest need by the most you can, then repeating the whole cycle. _Every_ improvement in such a process is a specific improvement, even if the improvement is a decision to re-architect the entire product to solve the current biggest issue. Improving sorting IO is cool. OTOH, if pg's biggest IO problems are elsewhere, then the amount of overall benefit we will get from improving sorting IO is going to be minimized until we improve the bigger problem(s). Amdahl's Law. >The "specialized patch" is also pointedly better >in that a *confidently submitted* patch is >likely to be way better than any sort of >"desperate clutching at whatever may come to hand." Another distortion of my statement and POV. I never suggested nor implied any sort of "desperate clutching...". We have _measurable_ IO issues that need to be addressed in order for pg to be a better competitor in the marketplace. Just as we do with sorting performance. >Far too often, I see people trying to address >performance problems via the "desperate >clutching at whatever seems near to hand," and that >generally turns out very badly as a particular >result of the whole "desperate clutching" part. > >If you can get a sort improvement submitted, that's a concrete improvement... As I said, I'm all in favor of concrete, measurable improvement. I do not think I ever stated I was in favor of anything else. You evidently are mildly ranting because you've seen some examples of poor SW Engineering Discipline/Practice by people with perhaps inadequate skills for the issues they were trying to address. We all have. "90% of everything is Jreck (eg of too low a quality)." OTOH, I do not think I've given you any reason to think I lack such Clue, nor do I think my post was advocating such thrashing. My post was intended to say that we need an Overall Systems Approach to pg optimization rather than just applying what compiler writer's call "peephole optimizations" to pg. No more, no less. I apologize if I somehow misled you, Ron Peacetree