Thread: Re: [HACKERS] What I'm working on

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

23 August 1998, 18:25:43

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> > I am working on a patch to:
> >
> >     remove oidname, oidint2, and oidint4
> >     allow the bootstrap code to create multi-key indexes
>
> Good man...always bugged me that the "old" hacked-in multikey
> indexes were there after Vadim let the user create them.
>
> But...returning to Insight as of Sept.1st.  Once I get settled
> in, I should be to stay late a couple of evenings and get my
> old patches up-to-date.

I have been thinking about the blocksize patch, and I now think it is
good we never installed it.  I think we need to enable rows to span more
than one block.  That is what commercial databases do, and I think this
is a much more general solution to the problem than increasing the block
size.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

23 August 1998, 21:32:04

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> [Charset iso-8859-1 unsupported, filtering to ASCII...]
> > > I am working on a patch to:
> > >
> > >     remove oidname, oidint2, and oidint4
> > >     allow the bootstrap code to create multi-key indexes
> >
> > Good man...always bugged me that the "old" hacked-in multikey
> > indexes were there after Vadim let the user create them.
> >
> > But...returning to Insight as of Sept.1st.  Once I get settled
> > in, I should be to stay late a couple of evenings and get my
> > old patches up-to-date.
>
> I have been thinking about the blocksize patch, and I now think it is
> good we never installed it.  I think we need to enable rows to span more
> than one block.  That is what commercial databases do, and I think this
> is a much more general solution to the problem than increasing the block
> size.

    Hrmmm...what does one gain over the other though?  The way I saw
it (sorry Darren, don't mean to oversimplify it), but making the blocksize
changeable was largely a matter of Darren making sure that all the
dependencies were covered through the code.  What is making a row span
multiple blocks going to give us?  Truly variable length "blocksizes"?

    The blocksize patch allows you to stipulate a different blocksize
at database creation time...actually, thinking about it, I kinda see them
as to inter-related, yet different, functions.  If, for instance, I create
a table that the majority of tuples are larger then 8k, but smaller then
12k, so that most of the tuples, in your "vision", span two
blocks...wouldn't being able to increase the blocksize to 12k provide a
performance improvement?

    I'm just not sure if I see either/or being mutually exclusive.
The 'row spanning' is great from the perspective that we didn't expect the
size of the tuples being larger then 8k, while the increase of blocksize
being great from an optimizing perspective.  Even having vacuum (or
something similar) reporting that >50% of the records are >$currblocksize
might be cool...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

23 August 1998, 21:37:01

>
>     Hrmmm...what does one gain over the other though?  The way I saw
> it (sorry Darren, don't mean to oversimplify it), but making the blocksize
> changeable was largely a matter of Darren making sure that all the
> dependencies were covered through the code.  What is making a row span
> multiple blocks going to give us?  Truly variable length "blocksizes"?
>
>     The blocksize patch allows you to stipulate a different blocksize
> at database creation time...actually, thinking about it, I kinda see them
> as to inter-related, yet different, functions.  If, for instance, I create
> a table that the majority of tuples are larger then 8k, but smaller then
> 12k, so that most of the tuples, in your "vision", span two
> blocks...wouldn't being able to increase the blocksize to 12k provide a
> performance improvement?
>
>     I'm just not sure if I see either/or being mutually exclusive.
> The 'row spanning' is great from the perspective that we didn't expect the
> size of the tuples being larger then 8k, while the increase of blocksize
> being great from an optimizing perspective.  Even having vacuum (or
> something similar) reporting that >50% of the records are >$currblocksize
> might be cool...

Most filesystem base block sizes are 8k.  Making anything larger is not
going to gain much.  I don't think we can support block sizes like 12k
because the filesystem is going to sync stuff in 8k chunks.

Seems like we should do the most user-transparent thing and just allow
spanning rows.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

23 August 1998, 22:04:23

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> Most filesystem base block sizes are 8k.  Making anything larger is not
> going to gain much.  I don't think we can support block sizes like 12k
> because the filesystem is going to sync stuff in 8k chunks.
>
> Seems like we should do the most user-transparent thing and just allow
> spanning rows.

    The blocksize patch wasn't a "user-land" feature, its an admin
level...no?  The admin sets it at the createdb level...no?

    Again, I'm curious as to why either/or is mutual exclusive?

    Let's put it this way, from a performance perspective, which one
would provide more?  Again, I'm thinking of this from the admin angle, not
user.  I create a database whose tuples, in general, exceed 8k.  vacuum
kindly tells me this, so, to improve performance, I dump my databases, and
because this is a specialized application, its on its own file system.
So, I reformat that drive with a larger blocksize, to match the blocksize
I'm about to set my database to (yes, I do do similar to this to optimize
file systems for news, so it isn't too hypothetical)...

    Bear in mind, I am not arguing for one of them, I'm arguing for
both of them...unless there is some architectural reason why both can't be
implemented at the same time...?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

23 August 1998, 22:07:22

> On Sun, 23 Aug 1998, Bruce Momjian wrote:
>
> > Most filesystem base block sizes are 8k.  Making anything larger is not
> > going to gain much.  I don't think we can support block sizes like 12k
> > because the filesystem is going to sync stuff in 8k chunks.
> >
> > Seems like we should do the most user-transparent thing and just allow
> > spanning rows.
>
>     The blocksize patch wasn't a "user-land" feature, its an admin
> level...no?  The admin sets it at the createdb level...no?

Yes, OK, admin, not user.


>
>     Again, I'm curious as to why either/or is mutual exclusive?
>
>     Let's put it this way, from a performance perspective, which one
> would provide more?  Again, I'm thinking of this from the admin angle, not
> user.  I create a database whose tuples, in general, exceed 8k.  vacuum
> kindly tells me this, so, to improve performance, I dump my databases, and
> because this is a specialized application, its on its own file system.
> So, I reformat that drive with a larger blocksize, to match the blocksize
> I'm about to set my database to (yes, I do do similar to this to optimize
> file systems for news, so it isn't too hypothetical)...
>
>     Bear in mind, I am not arguing for one of them, I'm arguing for
> both of them...unless there is some architectural reason why both can't be
> implemented at the same time...?

Yes, I guess you could have both.  I just think the normal user is going
to prefer the span stuff better, but you have a good point.  If we had
one, we could buy time getting the other.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

23 August 1998, 22:35:48

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> Yes, I guess you could have both.  I just think the normal user is going
> to prefer the span stuff better, but you have a good point.  If we had
> one, we could buy time getting the other.

    For whomever is implementing the row-span stuff, can something be
added that keeps track of number of rows that are spanned?  ie. if most of
the rows are spanning the rows, then I would personally like to know that
so that I can look at dumping and reloading the data with a database set
to a higher blocksize...

    There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

RE: [HACKERS] What I'm working on

From

"Stupor Genius"

Date:

23 August 1998, 23:00:43

>     There *has* to be some overhead, performance wise, in the database
> having to keep track of row-spanning, and being able to reduce that, IMHO,
> is what I see being able to change the blocksize as doing...

If both features were present, I would say to increase the blocksize of
the db to the max possible.  This would reduce the number of tuples that
are spanned.  Each span would require another tuple fetch, so that could
get expensive with each successive span or if every tuple spanned.

But if we stick with 8k blocksizes, people with tuples between 8 and 16k
would get absolutely killed performance-wise.  Would make sense for them
to go to 16k blocks where the reading of the extra bytes per block would
be minimal, if anything, compared to the fetching/processing of the next
span(s) to assemble the whole tuple.

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize.  Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

I'd say make the blocksize a run-time variable and then do the spanning.

Darren

RE: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

23 August 1998, 23:18:06

On Sun, 23 Aug 1998, Stupor Genius wrote:

> >     There *has* to be some overhead, performance wise, in the database
> > having to keep track of row-spanning, and being able to reduce that, IMHO,
> > is what I see being able to change the blocksize as doing...
>
> If both features were present, I would say to increase the blocksize of
> the db to the max possible.  This would reduce the number of tuples that
> are spanned.  Each span would require another tuple fetch, so that could
> get expensive with each successive span or if every tuple spanned.
>
> But if we stick with 8k blocksizes, people with tuples between 8 and 16k
> would get absolutely killed performance-wise.  Would make sense for them
> to go to 16k blocks where the reading of the extra bytes per block would
> be minimal, if anything, compared to the fetching/processing of the next
> span(s) to assemble the whole tuple.
>
> In summary, the capability to span would be the next resort after someone
> has maxed out their blocksize.  Each OS would have a different blocksize
> max...an AIX driver breaks when going past 16k...don't know about others.

    Oh...I like this :)  that would give us something that the "big
guys" don't also, no?  Bruce?

    Can someone clarify something for me?  If, for example, we have
the blocksize set to 16k, but the file system size is 8k, would the OS do
both reads at the same time in order to get the full 16k?  I hope someone
can follow this through (unless I'm actually clear), but if we left the
tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
a request to the OS for the one block, then, once we get that back,
determine that we need the next and request that?

    Damn, not clear at all...if I'm thinking right, by increasing the
blocksize to 16k, postgres does one read request, while the OS does two.
If we don't, postgres does two read requests while the OS still does two.

    Does that make sense?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

23 August 1998, 23:30:51

> On Sun, 23 Aug 1998, Bruce Momjian wrote:
>
> > Yes, I guess you could have both.  I just think the normal user is going
> > to prefer the span stuff better, but you have a good point.  If we had
> > one, we could buy time getting the other.
>
>     For whomever is implementing the row-span stuff, can something be
> added that keeps track of number of rows that are spanned?  ie. if most of
> the rows are spanning the rows, then I would personally like to know that
> so that I can look at dumping and reloading the data with a database set
> to a higher blocksize...
>
>     There *has* to be some overhead, performance wise, in the database
> having to keep track of row-spanning, and being able to reduce that, IMHO,
> is what I see being able to change the blocksize as doing...

Makes sense, though vacuum would presumably make all the blocks
contigious.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

23 August 1998, 23:54:09

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> > On Sun, 23 Aug 1998, Bruce Momjian wrote:
> >
> > > Yes, I guess you could have both.  I just think the normal user is going
> > > to prefer the span stuff better, but you have a good point.  If we had
> > > one, we could buy time getting the other.
> >
> >     For whomever is implementing the row-span stuff, can something be
> > added that keeps track of number of rows that are spanned?  ie. if most of
> > the rows are spanning the rows, then I would personally like to know that
> > so that I can look at dumping and reloading the data with a database set
> > to a higher blocksize...
> >
> >     There *has* to be some overhead, performance wise, in the database
> > having to keep track of row-spanning, and being able to reduce that, IMHO,
> > is what I see being able to change the blocksize as doing...
>
> Makes sense, though vacuum would presumably make all the blocks
> contigious.

    Still going to involve two read requests from the postmaster to
the operating system for those two rows...vs one if the tuple doesn't have
to span two blocks...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

24 August 1998, 00:00:41

>
>     Oh...I like this :)  that would give us something that the "big
> guys" don't also, no?  Bruce?
>
>     Can someone clarify something for me?  If, for example, we have
> the blocksize set to 16k, but the file system size is 8k, would the OS do
> both reads at the same time in order to get the full 16k?  I hope someone
> can follow this through (unless I'm actually clear), but if we left the
> tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
> a request to the OS for the one block, then, once we get that back,
> determine that we need the next and request that?

The filesystem block size really controls how fine-graned the file block
allocation is.  It keeps 8k blocks as one contigious chunk on the disk
(ignoring trailing file fragments which are blocksize/8 in size).

How the OS does the disk requests is different.  It is related to the
base size of a disk block(usually 512 bytes), and if multiple requests
can be sent to the drive at the same time(tagged queuing?).  These are
really not related to the filesystem block size, except that larger
block sizes are made up of larger contigious disk block groups.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

24 August 1998, 00:00:52

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> >     There *has* to be some overhead, performance wise, in the database
> > having to keep track of row-spanning, and being able to reduce that, IMHO,
> > is what I see being able to change the blocksize as doing...
>
> If both features were present, I would say to increase the blocksize of
> the db to the max possible.  This would reduce the number of tuples that
> are spanned.  Each span would require another tuple fetch, so that could
> get expensive with each successive span or if every tuple spanned.
>
> But if we stick with 8k blocksizes, people with tuples between 8 and 16k
> would get absolutely killed performance-wise.  Would make sense for them
> to go to 16k blocks where the reading of the extra bytes per block would
> be minimal, if anything, compared to the fetching/processing of the next
> span(s) to assemble the whole tuple.
>
> In summary, the capability to span would be the next resort after someone
> has maxed out their blocksize.  Each OS would have a different blocksize
> max...an AIX driver breaks when going past 16k...don't know about others.
>
> I'd say make the blocksize a run-time variable and then do the spanning.

If we could query to find the file system block size at runtime in a
portable way, that would help us pick the best block size, no?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

Bruce Momjian

Date:

24 August 1998, 00:09:50

>
>     Still going to involve two read requests from the postmaster to
> the operating system for those two rows...vs one if the tuple doesn't have
> to span two blocks...

Yes, assuming it is not already in our buffer cache.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

24 August 1998, 00:16:14

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> [Charset iso-8859-1 unsupported, filtering to ASCII...]
> > >     There *has* to be some overhead, performance wise, in the database
> > > having to keep track of row-spanning, and being able to reduce that, IMHO,
> > > is what I see being able to change the blocksize as doing...
> >
> > If both features were present, I would say to increase the blocksize of
> > the db to the max possible.  This would reduce the number of tuples that
> > are spanned.  Each span would require another tuple fetch, so that could
> > get expensive with each successive span or if every tuple spanned.
> >
> > But if we stick with 8k blocksizes, people with tuples between 8 and 16k
> > would get absolutely killed performance-wise.  Would make sense for them
> > to go to 16k blocks where the reading of the extra bytes per block would
> > be minimal, if anything, compared to the fetching/processing of the next
> > span(s) to assemble the whole tuple.
> >
> > In summary, the capability to span would be the next resort after someone
> > has maxed out their blocksize.  Each OS would have a different blocksize
> > max...an AIX driver breaks when going past 16k...don't know about others.
> >
> > I'd say make the blocksize a run-time variable and then do the spanning.
>
> If we could query to find the file system block size at runtime in a
> portable way, that would help us pick the best block size, no?

    That doesn't sound too safe to me...what if I run out of disk
space on file system A (16k blocksize) and move one of the databases to
file system B (8k blocksize)?  If it auto-detects at run time, how is that
going to affect the tables?  Now my tuple size just dropp'd to 8k, but the
tables were using 16k tuples...

    Setting this should, I think, be a conscious decision on the
admins part, unless, of course, there is nothing in the tables themselves
that are "hard coded" at 8k tuples, and its purely in the server?  If it
is just in the server, then this would be cool, cause then I wouldn't have
to dump/reload if I moved to a better tuned file system..just move the
files :)

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] What I'm working on

From

The Hermit Hacker

Date:

24 August 1998, 00:18:44

On Sun, 23 Aug 1998, Bruce Momjian wrote:

> >
> >     Oh...I like this :)  that would give us something that the "big
> > guys" don't also, no?  Bruce?
> >
> >     Can someone clarify something for me?  If, for example, we have
> > the blocksize set to 16k, but the file system size is 8k, would the OS do
> > both reads at the same time in order to get the full 16k?  I hope someone
> > can follow this through (unless I'm actually clear), but if we left the
> > tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
> > a request to the OS for the one block, then, once we get that back,
> > determine that we need the next and request that?
>
> The filesystem block size really controls how fine-graned the file block
> allocation is.  It keeps 8k blocks as one contigious chunk on the disk
> (ignoring trailing file fragments which are blocksize/8 in size).
>
> How the OS does the disk requests is different.  It is related to the
> base size of a disk block(usually 512 bytes), and if multiple requests
> can be sent to the drive at the same time(tagged queuing?).  These are
> really not related to the filesystem block size, except that larger
> block sizes are made up of larger contigious disk block groups.

    Okay...but, what I was more trying to get at was that, ignoring
the operating system level right now, a 16k tuple that has to span two 8k
'rows' is going to require:

    1 read for the first half
    processing to determine that a second half is required
    1 read for the second half

    A 16k that spans a single 16k row will require:

    1 read for the whole thing

    considering all the streamlining that you've been working at, it
seems illogical to advocate a two read system only, when we can have a two
read system that gives us a base solution, with a one read system for
those that wish to reduce that overhead...no?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org