Thread: Re: [HACKERS] What I'm working on
[Charset iso-8859-1 unsupported, filtering to ASCII...] > > I am working on a patch to: > > > > remove oidname, oidint2, and oidint4 > > allow the bootstrap code to create multi-key indexes > > Good man...always bugged me that the "old" hacked-in multikey > indexes were there after Vadim let the user create them. > > But...returning to Insight as of Sept.1st. Once I get settled > in, I should be to stay late a couple of evenings and get my > old patches up-to-date. I have been thinking about the blocksize patch, and I now think it is good we never installed it. I think we need to enable rows to span more than one block. That is what commercial databases do, and I think this is a much more general solution to the problem than increasing the block size. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 23 Aug 1998, Bruce Momjian wrote: > [Charset iso-8859-1 unsupported, filtering to ASCII...] > > > I am working on a patch to: > > > > > > remove oidname, oidint2, and oidint4 > > > allow the bootstrap code to create multi-key indexes > > > > Good man...always bugged me that the "old" hacked-in multikey > > indexes were there after Vadim let the user create them. > > > > But...returning to Insight as of Sept.1st. Once I get settled > > in, I should be to stay late a couple of evenings and get my > > old patches up-to-date. > > I have been thinking about the blocksize patch, and I now think it is > good we never installed it. I think we need to enable rows to span more > than one block. That is what commercial databases do, and I think this > is a much more general solution to the problem than increasing the block > size. Hrmmm...what does one gain over the other though? The way I saw it (sorry Darren, don't mean to oversimplify it), but making the blocksize changeable was largely a matter of Darren making sure that all the dependencies were covered through the code. What is making a row span multiple blocks going to give us? Truly variable length "blocksizes"? The blocksize patch allows you to stipulate a different blocksize at database creation time...actually, thinking about it, I kinda see them as to inter-related, yet different, functions. If, for instance, I create a table that the majority of tuples are larger then 8k, but smaller then 12k, so that most of the tuples, in your "vision", span two blocks...wouldn't being able to increase the blocksize to 12k provide a performance improvement? I'm just not sure if I see either/or being mutually exclusive. The 'row spanning' is great from the perspective that we didn't expect the size of the tuples being larger then 8k, while the increase of blocksize being great from an optimizing perspective. Even having vacuum (or something similar) reporting that >50% of the records are >$currblocksize might be cool... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > Hrmmm...what does one gain over the other though? The way I saw > it (sorry Darren, don't mean to oversimplify it), but making the blocksize > changeable was largely a matter of Darren making sure that all the > dependencies were covered through the code. What is making a row span > multiple blocks going to give us? Truly variable length "blocksizes"? > > The blocksize patch allows you to stipulate a different blocksize > at database creation time...actually, thinking about it, I kinda see them > as to inter-related, yet different, functions. If, for instance, I create > a table that the majority of tuples are larger then 8k, but smaller then > 12k, so that most of the tuples, in your "vision", span two > blocks...wouldn't being able to increase the blocksize to 12k provide a > performance improvement? > > I'm just not sure if I see either/or being mutually exclusive. > The 'row spanning' is great from the perspective that we didn't expect the > size of the tuples being larger then 8k, while the increase of blocksize > being great from an optimizing perspective. Even having vacuum (or > something similar) reporting that >50% of the records are >$currblocksize > might be cool... Most filesystem base block sizes are 8k. Making anything larger is not going to gain much. I don't think we can support block sizes like 12k because the filesystem is going to sync stuff in 8k chunks. Seems like we should do the most user-transparent thing and just allow spanning rows. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 23 Aug 1998, Bruce Momjian wrote: > Most filesystem base block sizes are 8k. Making anything larger is not > going to gain much. I don't think we can support block sizes like 12k > because the filesystem is going to sync stuff in 8k chunks. > > Seems like we should do the most user-transparent thing and just allow > spanning rows. The blocksize patch wasn't a "user-land" feature, its an admin level...no? The admin sets it at the createdb level...no? Again, I'm curious as to why either/or is mutual exclusive? Let's put it this way, from a performance perspective, which one would provide more? Again, I'm thinking of this from the admin angle, not user. I create a database whose tuples, in general, exceed 8k. vacuum kindly tells me this, so, to improve performance, I dump my databases, and because this is a specialized application, its on its own file system. So, I reformat that drive with a larger blocksize, to match the blocksize I'm about to set my database to (yes, I do do similar to this to optimize file systems for news, so it isn't too hypothetical)... Bear in mind, I am not arguing for one of them, I'm arguing for both of them...unless there is some architectural reason why both can't be implemented at the same time...? Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> On Sun, 23 Aug 1998, Bruce Momjian wrote: > > > Most filesystem base block sizes are 8k. Making anything larger is not > > going to gain much. I don't think we can support block sizes like 12k > > because the filesystem is going to sync stuff in 8k chunks. > > > > Seems like we should do the most user-transparent thing and just allow > > spanning rows. > > The blocksize patch wasn't a "user-land" feature, its an admin > level...no? The admin sets it at the createdb level...no? Yes, OK, admin, not user. > > Again, I'm curious as to why either/or is mutual exclusive? > > Let's put it this way, from a performance perspective, which one > would provide more? Again, I'm thinking of this from the admin angle, not > user. I create a database whose tuples, in general, exceed 8k. vacuum > kindly tells me this, so, to improve performance, I dump my databases, and > because this is a specialized application, its on its own file system. > So, I reformat that drive with a larger blocksize, to match the blocksize > I'm about to set my database to (yes, I do do similar to this to optimize > file systems for news, so it isn't too hypothetical)... > > Bear in mind, I am not arguing for one of them, I'm arguing for > both of them...unless there is some architectural reason why both can't be > implemented at the same time...? Yes, I guess you could have both. I just think the normal user is going to prefer the span stuff better, but you have a good point. If we had one, we could buy time getting the other. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 23 Aug 1998, Bruce Momjian wrote: > Yes, I guess you could have both. I just think the normal user is going > to prefer the span stuff better, but you have a good point. If we had > one, we could buy time getting the other. For whomever is implementing the row-span stuff, can something be added that keeps track of number of rows that are spanned? ie. if most of the rows are spanning the rows, then I would personally like to know that so that I can look at dumping and reloading the data with a database set to a higher blocksize... There *has* to be some overhead, performance wise, in the database having to keep track of row-spanning, and being able to reduce that, IMHO, is what I see being able to change the blocksize as doing... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> There *has* to be some overhead, performance wise, in the database > having to keep track of row-spanning, and being able to reduce that, IMHO, > is what I see being able to change the blocksize as doing... If both features were present, I would say to increase the blocksize of the db to the max possible. This would reduce the number of tuples that are spanned. Each span would require another tuple fetch, so that could get expensive with each successive span or if every tuple spanned. But if we stick with 8k blocksizes, people with tuples between 8 and 16k would get absolutely killed performance-wise. Would make sense for them to go to 16k blocks where the reading of the extra bytes per block would be minimal, if anything, compared to the fetching/processing of the next span(s) to assemble the whole tuple. In summary, the capability to span would be the next resort after someone has maxed out their blocksize. Each OS would have a different blocksize max...an AIX driver breaks when going past 16k...don't know about others. I'd say make the blocksize a run-time variable and then do the spanning. Darren
On Sun, 23 Aug 1998, Stupor Genius wrote: > > There *has* to be some overhead, performance wise, in the database > > having to keep track of row-spanning, and being able to reduce that, IMHO, > > is what I see being able to change the blocksize as doing... > > If both features were present, I would say to increase the blocksize of > the db to the max possible. This would reduce the number of tuples that > are spanned. Each span would require another tuple fetch, so that could > get expensive with each successive span or if every tuple spanned. > > But if we stick with 8k blocksizes, people with tuples between 8 and 16k > would get absolutely killed performance-wise. Would make sense for them > to go to 16k blocks where the reading of the extra bytes per block would > be minimal, if anything, compared to the fetching/processing of the next > span(s) to assemble the whole tuple. > > In summary, the capability to span would be the next resort after someone > has maxed out their blocksize. Each OS would have a different blocksize > max...an AIX driver breaks when going past 16k...don't know about others. Oh...I like this :) that would give us something that the "big guys" don't also, no? Bruce? Can someone clarify something for me? If, for example, we have the blocksize set to 16k, but the file system size is 8k, would the OS do both reads at the same time in order to get the full 16k? I hope someone can follow this through (unless I'm actually clear), but if we left the tuples size at 8k fixed, and had that 16k tuple span two rows, do we send a request to the OS for the one block, then, once we get that back, determine that we need the next and request that? Damn, not clear at all...if I'm thinking right, by increasing the blocksize to 16k, postgres does one read request, while the OS does two. If we don't, postgres does two read requests while the OS still does two. Does that make sense? Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> On Sun, 23 Aug 1998, Bruce Momjian wrote: > > > Yes, I guess you could have both. I just think the normal user is going > > to prefer the span stuff better, but you have a good point. If we had > > one, we could buy time getting the other. > > For whomever is implementing the row-span stuff, can something be > added that keeps track of number of rows that are spanned? ie. if most of > the rows are spanning the rows, then I would personally like to know that > so that I can look at dumping and reloading the data with a database set > to a higher blocksize... > > There *has* to be some overhead, performance wise, in the database > having to keep track of row-spanning, and being able to reduce that, IMHO, > is what I see being able to change the blocksize as doing... Makes sense, though vacuum would presumably make all the blocks contigious. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 23 Aug 1998, Bruce Momjian wrote: > > On Sun, 23 Aug 1998, Bruce Momjian wrote: > > > > > Yes, I guess you could have both. I just think the normal user is going > > > to prefer the span stuff better, but you have a good point. If we had > > > one, we could buy time getting the other. > > > > For whomever is implementing the row-span stuff, can something be > > added that keeps track of number of rows that are spanned? ie. if most of > > the rows are spanning the rows, then I would personally like to know that > > so that I can look at dumping and reloading the data with a database set > > to a higher blocksize... > > > > There *has* to be some overhead, performance wise, in the database > > having to keep track of row-spanning, and being able to reduce that, IMHO, > > is what I see being able to change the blocksize as doing... > > Makes sense, though vacuum would presumably make all the blocks > contigious. Still going to involve two read requests from the postmaster to the operating system for those two rows...vs one if the tuple doesn't have to span two blocks... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > Oh...I like this :) that would give us something that the "big > guys" don't also, no? Bruce? > > Can someone clarify something for me? If, for example, we have > the blocksize set to 16k, but the file system size is 8k, would the OS do > both reads at the same time in order to get the full 16k? I hope someone > can follow this through (unless I'm actually clear), but if we left the > tuples size at 8k fixed, and had that 16k tuple span two rows, do we send > a request to the OS for the one block, then, once we get that back, > determine that we need the next and request that? The filesystem block size really controls how fine-graned the file block allocation is. It keeps 8k blocks as one contigious chunk on the disk (ignoring trailing file fragments which are blocksize/8 in size). How the OS does the disk requests is different. It is related to the base size of a disk block(usually 512 bytes), and if multiple requests can be sent to the drive at the same time(tagged queuing?). These are really not related to the filesystem block size, except that larger block sizes are made up of larger contigious disk block groups. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
[Charset iso-8859-1 unsupported, filtering to ASCII...] > > There *has* to be some overhead, performance wise, in the database > > having to keep track of row-spanning, and being able to reduce that, IMHO, > > is what I see being able to change the blocksize as doing... > > If both features were present, I would say to increase the blocksize of > the db to the max possible. This would reduce the number of tuples that > are spanned. Each span would require another tuple fetch, so that could > get expensive with each successive span or if every tuple spanned. > > But if we stick with 8k blocksizes, people with tuples between 8 and 16k > would get absolutely killed performance-wise. Would make sense for them > to go to 16k blocks where the reading of the extra bytes per block would > be minimal, if anything, compared to the fetching/processing of the next > span(s) to assemble the whole tuple. > > In summary, the capability to span would be the next resort after someone > has maxed out their blocksize. Each OS would have a different blocksize > max...an AIX driver breaks when going past 16k...don't know about others. > > I'd say make the blocksize a run-time variable and then do the spanning. If we could query to find the file system block size at runtime in a portable way, that would help us pick the best block size, no? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Still going to involve two read requests from the postmaster to > the operating system for those two rows...vs one if the tuple doesn't have > to span two blocks... Yes, assuming it is not already in our buffer cache. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sun, 23 Aug 1998, Bruce Momjian wrote: > [Charset iso-8859-1 unsupported, filtering to ASCII...] > > > There *has* to be some overhead, performance wise, in the database > > > having to keep track of row-spanning, and being able to reduce that, IMHO, > > > is what I see being able to change the blocksize as doing... > > > > If both features were present, I would say to increase the blocksize of > > the db to the max possible. This would reduce the number of tuples that > > are spanned. Each span would require another tuple fetch, so that could > > get expensive with each successive span or if every tuple spanned. > > > > But if we stick with 8k blocksizes, people with tuples between 8 and 16k > > would get absolutely killed performance-wise. Would make sense for them > > to go to 16k blocks where the reading of the extra bytes per block would > > be minimal, if anything, compared to the fetching/processing of the next > > span(s) to assemble the whole tuple. > > > > In summary, the capability to span would be the next resort after someone > > has maxed out their blocksize. Each OS would have a different blocksize > > max...an AIX driver breaks when going past 16k...don't know about others. > > > > I'd say make the blocksize a run-time variable and then do the spanning. > > If we could query to find the file system block size at runtime in a > portable way, that would help us pick the best block size, no? That doesn't sound too safe to me...what if I run out of disk space on file system A (16k blocksize) and move one of the databases to file system B (8k blocksize)? If it auto-detects at run time, how is that going to affect the tables? Now my tuple size just dropp'd to 8k, but the tables were using 16k tuples... Setting this should, I think, be a conscious decision on the admins part, unless, of course, there is nothing in the tables themselves that are "hard coded" at 8k tuples, and its purely in the server? If it is just in the server, then this would be cool, cause then I wouldn't have to dump/reload if I moved to a better tuned file system..just move the files :) Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Sun, 23 Aug 1998, Bruce Momjian wrote: > > > > Oh...I like this :) that would give us something that the "big > > guys" don't also, no? Bruce? > > > > Can someone clarify something for me? If, for example, we have > > the blocksize set to 16k, but the file system size is 8k, would the OS do > > both reads at the same time in order to get the full 16k? I hope someone > > can follow this through (unless I'm actually clear), but if we left the > > tuples size at 8k fixed, and had that 16k tuple span two rows, do we send > > a request to the OS for the one block, then, once we get that back, > > determine that we need the next and request that? > > The filesystem block size really controls how fine-graned the file block > allocation is. It keeps 8k blocks as one contigious chunk on the disk > (ignoring trailing file fragments which are blocksize/8 in size). > > How the OS does the disk requests is different. It is related to the > base size of a disk block(usually 512 bytes), and if multiple requests > can be sent to the drive at the same time(tagged queuing?). These are > really not related to the filesystem block size, except that larger > block sizes are made up of larger contigious disk block groups. Okay...but, what I was more trying to get at was that, ignoring the operating system level right now, a 16k tuple that has to span two 8k 'rows' is going to require: 1 read for the first half processing to determine that a second half is required 1 read for the second half A 16k that spans a single 16k row will require: 1 read for the whole thing considering all the streamlining that you've been working at, it seems illogical to advocate a two read system only, when we can have a two read system that gives us a base solution, with a one read system for those that wish to reduce that overhead...no? Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org