Thread: Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
The Hermit Hacker
Date:
Redirected to 'the proper list' - pgsql-hackers@postgresql.org

On Wed, 11 Mar 1998, Chris Albertson wrote:

> Also, is anyone working on storage mangers?  I was thinking that
> a raw partition manager would be good to have.  Could be faster
> then one that uses the file system.  Give it two partitions and
> it could do stripping and gain some real speed.

    stripping can be done from the operating system level to give you
that 'boost'...and Oracle, in fact, moved away from the raw partition
level to just using the Unix file system...I believe it would
overcomplicate the backend, and give a negligible boost in performance, if
we had to build a 'low level drive interface'...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Chris Albertson
Date:
The Hermit Hacker wrote:
>
> Redirected to 'the proper list' - pgsql-hackers@postgresql.org
>
> On Wed, 11 Mar 1998, Chris Albertson wrote:
>
> > Also, is anyone working on storage mangers?  I was thinking that
> > a raw partition manager would be good to have.  Could be faster
> > then one that uses the file system.  Give it two partitions and
> > it could do stripping and gain some real speed.
>
>         stripping can be done from the operating system level to give you
> that 'boost'...and Oracle, in fact, moved away from the raw partition
> level to just using the Unix file system...I believe it would
> overcomplicate the backend, and give a negligible boost in performance, if
> we had to build a 'low level drive interface'...

I know you must have looked at far more Postgresql code then I have but
I was browsing the storage manager.  Apparently it is fairly easy to
assign a class to a manager as each class is tagged in the system catalog
with a storage method.  What I really want is a >2GB table.  I was trying
to see if this was supported by reading the source.  Looks like it may be.
The note in the To Do list includes testing.  I would test it but for
lack of disk space. (I'll have more in a while.)

I need the >2GB feature bad enough that I'd implement it myself.  My thought
was that I may to easier to write a new manager then understand and fix
a broken one.  A manager is just given a class name and block number and
told to either fetch or get it.  (well not quite so simple but close).

I don't think it needs to look inside the 8K (adjustable now) blocks.
Anyway, if I wrote such a beast my real motivation would be to have big
tables.  Faster big tables would be just a plus.  What I really hope for is
that somebody else fixes the existing code :-)
--
--Chris Albertson

  chris@topdog.logicon.com                Voice:  626-351-0089  X127
  Logicon RDA, Pasadena California          Fax:  626-351-0699

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
dg@illustra.com (David Gould)
Date:
> Redirected to 'the proper list' - pgsql-hackers@postgresql.org
>
> On Wed, 11 Mar 1998, Chris Albertson wrote:
>
> > Also, is anyone working on storage mangers?  I was thinking that
> > a raw partition manager would be good to have.  Could be faster
> > then one that uses the file system.  Give it two partitions and
> > it could do stripping and gain some real speed.
>
>     stripping can be done from the operating system level to give you
> that 'boost'...and Oracle, in fact, moved away from the raw partition
> level to just using the Unix file system...I believe it would
> overcomplicate the backend, and give a negligible boost in performance, if
> we had to build a 'low level drive interface'...
>
> Marc G. Fournier
> Systems Administrator @ hub.org
> primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

I have had the pleasure to work on the guts of one of the major databases
raw partition storage managers over the last ten years (hint, not my
current domain), and guess what? It implements a file system. And, not a
particularly good filesystem at that. Think about something like "FAT",
but not quite that nice. It was also a major source of pain in that it
was complex, heavily concurrent, and any errors show up as massive data
loss or corruption. Be careful what you wish for.

Most of the supposed benefit comes from integrating the buffer cache
management and the writeahead log so that you can defer or avoid I/O (as
long as the log records get to disk, there is no reason to ever write the
data page unless you need the buffer for something else). You can also
convert random I/O to semi sequential I/O if most writes are done by a page
cleaner or by a checkpoint as this gives you lots of I/O to sort.

I don't know the current state of Postgres so I cannot comment on it, but at
least with Illustra, the lack of a traditional writeahead log style
transaction system was a major performance hit as it forced an fsync at
every commit. A good WAL system gets many commits per log I/O, but
Illusta was stuck with many writes per transaction. If Postgres still does
this (and the recent elimination of time travel suggests that it might not),
it would be well worth fixing.

A last point, the raw disk, implement our own filesystem architecture used
by some systems is much more compelling if the filesystems are slow and
inflexible, and the filesystem caching is ineffective. These things were
more true back in the early 80's when these systems were being designed.
Things are not as bad now, in particular ext2 has quite good performance.

Sorry for the ramble...

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - I realize now that irony has no place in business communications.


Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Bruce Momjian
Date:
> I have had the pleasure to work on the guts of one of the major databases
> raw partition storage managers over the last ten years (hint, not my
> current domain), and guess what? It implements a file system. And, not a
> particularly good filesystem at that. Think about something like "FAT",
> but not quite that nice. It was also a major source of pain in that it
> was complex, heavily concurrent, and any errors show up as massive data
> loss or corruption. Be careful what you wish for.

Interesting.

>
> Most of the supposed benefit comes from integrating the buffer cache
> management and the writeahead log so that you can defer or avoid I/O (as
> long as the log records get to disk, there is no reason to ever write the
> data page unless you need the buffer for something else). You can also
> convert random I/O to semi sequential I/O if most writes are done by a page
> cleaner or by a checkpoint as this gives you lots of I/O to sort.

Yes, I have heard that the standard file system read-ahead is often
useless for a database, so on a raw partition you know the next block
that is going to be requested, so you can prefetch there rather than
having the file system prefetch the next sequential block.

Also nice so you can control what gets written to disk/fsync'ed and what doesn't
get fsync'ed.

> I don't know the current state of Postgres so I cannot comment on it, but at
> least with Illustra, the lack of a traditional writeahead log style
> transaction system was a major performance hit as it forced an fsync at
> every commit. A good WAL system gets many commits per log I/O, but
> Illusta was stuck with many writes per transaction. If Postgres still does
> this (and the recent elimination of time travel suggests that it might not),
> it would be well worth fixing.

Our idea is to control when pg_log gets written to disk.  We keep active
pg_log pages in shared memory, and every 30-60 seconds, we make a memory
copy of the current pg_log active pages, do a system sync() (which
happens anyway at that interval), update the pg_log file with the saved
changes, and fsync() the pg_log pages to disk.  That way, after a crash,
the current database only shows transactions as committed where we are
sure all the data has made it to disk.

I have a more detailed posting if you are interested.

> A last point, the raw disk, implement our own filesystem architecture used
> by some systems is much more compelling if the filesystems are slow and
> inflexible, and the filesystem caching is ineffective. These things were
> more true back in the early 80's when these systems were being designed.
> Things are not as bad now, in particular ext2 has quite good performance.
>
> Sorry for the ramble...

No ramble at all.  It is not every day we get someone with real-world
experience in changing from a filesystem to a raw partition database
storage manager.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Bruce Momjian
Date:
Here is an archive of the pg_log discussion.

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711170542.AAA24561@candle.pha.pa.us>
Subject: [HACKERS] Bufferd loggins/pg_log
To: hackers@postgreSQL.org (PostgreSQL-development)
Date: Mon, 17 Nov 1997 00:42:18 -0500 (EST)
Cc: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Here is my current idea for doing bufferd logging, and exists between
the normal fsync on every transaction and no-fsync options.  I believe
it will be very popular, because it mimicks the Unix file system
reliability structure.

---------------------------------------------------------------------------

On startup, the postmaster makes a copy of pg_log, called pg_log_live.
Each postgres backend mmaps() this new file into its address space.  A
lock is gotten to make changes to the file.  All backend use pg_log_live
rather than pg_log.  Only the postmaster write to pg_log.  (I will
someday remove the exec() from postmaster, so backends will get this
address space automatically.)

The first 512 bytes of pg_log and pg_log_live are used for log managment
information.  We add a new field to pg_log_live called min_xid_commit
which records the lowest transaction id that any backend has committed
since the start of the last sync pass of the postmater.  We also add
fields to record current pg_variable oid and xid at the same time.  (xid
may have to be moved into pg_variable so backends can fsync it (see
below).)

Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
the minimum xid set in the start of pg_log, and resets its value.  It
records the current oid and xid from pg_variable.  It then clears the
lock, and starts reading from the minimum recorded xid changed to the
end of pg_log_live, and copies it into allocated memory.  It then does a
sync (twice?), waits for completion, and then writes the pg_log_live
partial copy it made to pg_log.  We update the copies of oid and xid we
saved before the sync to the bottom of pg_log_live.

We can change the 60-90 seconds to be longer, but the system does it
every 30 seconds anyway.

When the postmaster stops, it does this same operation before shutting
down, and pg_log_live is removed.

We make a copy of the current xid and oid in the front of pg_log_live,
so that if the postmaster starts up, and pg_log_live exists, the
postmaster adds 10,000 to xid and oid of pg_variable, so no previously
used but unsynced values are used.

We know that the current values of pg_variable could not have been
exceeded by 10,000, because each backend consults the pg_log copies of
these variable to make sure they do not exceed 10,000 from the value
before the last sync.  They exceed those values only by fscyn'ing every
10,000 increments.

Said another way, if a postgres backend exceeds the pg_log last xid or
oid of pg_log, or any 10,000 multiple, it must fsync the change to
pg_variable.  This way, a crash skips over any unsynced oid/xid's used,
and this is done without having to keep fsyncing pg_variable.  In most
cases, the 10,000 will never be exceeded by a backend before the
postmaster does a sync and increases the last xid/oid again.

I think this is a very clean way to give us no-fync performance with
full-rollback buffered logging.  The specification is clean and almost
complete enough for coding.

I think this gives us what we need, by having a mmap'ed() pg_log_live,
which backends can use, and a postmaster-controlled pg_log, which is
used on startup, with xid/oid controls in a crash situation to skip over
partially committed transactions.

Comments?

--
Bruce Momjian
maillist@candle.pha.pa.us


---------------------------------------------------------------------------

Sender: root@www.krasnet.ru
Message-ID: <346FF895.167EB0E7@sable.krasnoyarsk.su>
Date: Mon, 17 Nov 1997 14:56:05 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: PostgreSQL-development <hackers@postgreSQL.org>,
        "Vadim B. Mikheev" <vadim@post.krasnet.ru>
Subject: Re: Bufferd loggins/pg_log
References: <199711170542.AAA24561@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> On startup, the postmaster makes a copy of pg_log, called pg_log_live.
> Each postgres backend mmaps() this new file into its address space.  A
> lock is gotten to make changes to the file.  All backend use pg_log_live
> rather than pg_log.  Only the postmaster write to pg_log.  (I will
> someday remove the exec() from postmaster, so backends will get this
> address space automatically.)

What are advantages of mmaping entire pg_log over "online" pg_log
pages ?
pg_log may be very big (tens of Mb) - why we have to spend
process address space for tens of Mb of mostly unused data ?
Also, do all systems have mmap ?

>
> Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
> the minimum xid set in the start of pg_log, and resets its value.  It
> records the current oid and xid from pg_variable.  It then clears the
> lock, and starts reading from the minimum recorded xid changed to the
> end of pg_log_live, and copies it into allocated memory.  It then does a
> sync (twice?), waits for completion, and then writes the pg_log_live
        ^^^^^
man sync:

     The sync() function forces a write of dirty (modified) buffers in the
                         ^^^^^^
     block buffer cache out to disk...
...

BUGS
     Sync() may return before the buffers are completely flushed.

Vadim

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711171346.IAA01964@candle.pha.pa.us>
Subject: [HACKERS] Re: Bufferd loggins/pg_log
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Mon, 17 Nov 1997 08:46:29 -0500 (EST)
Cc: hackers@postgreSQL.org (PostgreSQL-development)
In-Reply-To: <346FF895.167EB0E7@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 17, 97 02:56:05 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> Bruce Momjian wrote:
> >
> > On startup, the postmaster makes a copy of pg_log, called pg_log_live.
> > Each postgres backend mmaps() this new file into its address space.  A
> > lock is gotten to make changes to the file.  All backend use pg_log_live
> > rather than pg_log.  Only the postmaster write to pg_log.  (I will
> > someday remove the exec() from postmaster, so backends will get this
> > address space automatically.)
>
> What are advantages of mmaping entire pg_log over "online" pg_log
> pages ?
> pg_log may be very big (tens of Mb) - why we have to spend
> process address space for tens of Mb of mostly unused data ?
> Also, do all systems have mmap ?

I believe you are correct that it would be better keeping the last few
pages of pg_log in shared memory rather than using mmap().

I think the important new ideas are keeping track of the oid/xid before
sync so we can accurately add 10,000 after a crash.

I am a little foggy on race condiions of growing the pg_log region while
other backends are running, and modifying non-shared memory pages, but
you seem to have a handle on it.

We don't need pg_log_live if only the postmaster writes those last two
pages to pg_log, and if we keep track of a crash status somewhere else,
perhaps at the start of pg_log.

>
> >
> > Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
> > the minimum xid set in the start of pg_log, and resets its value.  It
> > records the current oid and xid from pg_variable.  It then clears the
> > lock, and starts reading from the minimum recorded xid changed to the
> > end of pg_log_live, and copies it into allocated memory.  It then does a
> > sync (twice?), waits for completion, and then writes the pg_log_live
>         ^^^^^
> man sync:
>
>      The sync() function forces a write of dirty (modified) buffers in the
>                          ^^^^^^
>      block buffer cache out to disk...
> ...
>
> BUGS
>      Sync() may return before the buffers are completely flushed.
>
> Vadim
>

My BSD/OS doesn't mention this, but twice is a good idea.



--
Bruce Momjian
maillist@candle.pha.pa.us




--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
ocie@paracel.com
Date:
Bruce Momjian wrote:
>
> > I have had the pleasure to work on the guts of one of the major databases
> > raw partition storage managers over the last ten years (hint, not my
> > current domain), and guess what? It implements a file system. And, not a
> > particularly good filesystem at that. Think about something like "FAT",
> > but not quite that nice. It was also a major source of pain in that it
> > was complex, heavily concurrent, and any errors show up as massive data
> > loss or corruption. Be careful what you wish for.
>
> Interesting.

Perhaps we could:

a) Incorporate an existing filesystem into the code (ext2?).  By
Incorporate, I mean that we would just take the latest version of the
code and link it into the executable, or into a library and make calls
to some of the lower level access and allocation routines.

b) suggest that for higher performance, the user should format the
disk partition with ext2 (or whatever) and turn off caching and set the
block size to the maximum possible.

I know for a fact that ext2 lets the user select the block size, and
it looks like Linux at least supports a sync mount option which makes
all I/O to this FS synchronous (which I assume would turn off write
caching at least).  If caching could be disabled, then option b would
seem to provide performance equivalent to a.

Ocie Mitchell

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Bruce Momjian
Date:
>
> Bruce Momjian wrote:
> >
> > > I have had the pleasure to work on the guts of one of the major databases
> > > raw partition storage managers over the last ten years (hint, not my
> > > current domain), and guess what? It implements a file system. And, not a
> > > particularly good filesystem at that. Think about something like "FAT",
> > > but not quite that nice. It was also a major source of pain in that it
> > > was complex, heavily concurrent, and any errors show up as massive data
> > > loss or corruption. Be careful what you wish for.
> >
> > Interesting.
>
> Perhaps we could:
>
> a) Incorporate an existing filesystem into the code (ext2?).  By
> Incorporate, I mean that we would just take the latest version of the
> code and link it into the executable, or into a library and make calls
> to some of the lower level access and allocation routines.
>
> b) suggest that for higher performance, the user should format the
> disk partition with ext2 (or whatever) and turn off caching and set the
> block size to the maximum possible.
>
> I know for a fact that ext2 lets the user select the block size, and
> it looks like Linux at least supports a sync mount option which makes
> all I/O to this FS synchronous (which I assume would turn off write
> caching at least).  If caching could be disabled, then option b would
> seem to provide performance equivalent to a.

I checked, and under BSD/OS, the readahead call for ufs looks like:

            error = breadn(vp, lbn, size,
                (daddr_t *)&nextlbn, &nextsize, 1, NOCRED, &bp);

The '1' is requesting a read-ahead of one block past the requested
block.  Clearly this is not tunable, tough a read-ahead of one is not a
significant performance problem.  In most cases, the block was already
read as part of the disk scan, but this gives us the next block in cases
where we are reading sequentially.

The sync option is not really desired because we do our own syncs on
transaction completion.  Don't want a sync on every write.  Don't think
you can disable caching.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian writes:
> Yes, I have heard that the standard file system read-ahead is often
> useless for a database, so on a raw partition you know the next block
> that is going to be requested, so you can prefetch there rather than
> having the file system prefetch the next sequential block.

At least on the systems I am intimately familiar with, the prefetch that the
OS does (assuming a modern OS like Linux) is pretty hard to beat. If you have
a table that was bulk loaded in key order, a sequential scan is going to
result in a sequential access pattern to the underlying file and the OS
prefetch does the right thing. If you have an unindexed table with rows
inserted at the end, the OS prefetch still works. If you are using a secondary
index on some sort of chopped up table with rows inserted willy-nilly, it
then, it may be worth doing async reads in a burst and let the disk request
sort make the best of it.

As far as I am aware, Postgres does not do async I/O. Perhaps it should.

> Also nice so you can control what gets written to disk/fsync'ed and what doesn't
> get fsync'ed.

This is really the big win.

> Our idea is to control when pg_log gets written to disk.  We keep active
> pg_log pages in shared memory, and every 30-60 seconds, we make a memory
> copy of the current pg_log active pages, do a system sync() (which
> happens anyway at that interval), update the pg_log file with the saved
> changes, and fsync() the pg_log pages to disk.  That way, after a crash,
> the current database only shows transactions as committed where we are
> sure all the data has made it to disk.

OK as far as it goes, but probably bad for concurrancy if I have understood
you.

> I have a more detailed posting if you are interested.

Thanks, I will read it. Probably should hold more comments until after that ;-)

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - I realize now that irony has no place in business communications.


Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Bruce Momjian
Date:
> At least on the systems I am intimately familiar with, the prefetch that the
> OS does (assuming a modern OS like Linux) is pretty hard to beat. If you have
> a table that was bulk loaded in key order, a sequential scan is going to
> result in a sequential access pattern to the underlying file and the OS
> prefetch does the right thing. If you have an unindexed table with rows
> inserted at the end, the OS prefetch still works. If you are using a secondary
> index on some sort of chopped up table with rows inserted willy-nilly, it
> then, it may be worth doing async reads in a burst and let the disk request
> sort make the best of it.
>
> As far as I am aware, Postgres does not do async I/O. Perhaps it should.

I am adding this to the TODO list:

    * Do async I/O to do better read-ahead of data

Because we are not threaded, we really can't do anything else while we
are waiting for I/O, but we can pre-request data we know we will need.

>
> > Also nice so you can control what gets written to disk/fsync'ed and what doesn't
> > get fsync'ed.
>
> This is really the big win.

Yep, and this is what we are trying to work around in our buffered
pg_log change.  Because we have the transaction ids all compact in one
place, this seems like a workable solution to our lack of write-to-disk
control.  We just control the pg_log writes.

>
> > Our idea is to control when pg_log gets written to disk.  We keep active
> > pg_log pages in shared memory, and every 30-60 seconds, we make a memory
> > copy of the current pg_log active pages, do a system sync() (which
> > happens anyway at that interval), update the pg_log file with the saved
> > changes, and fsync() the pg_log pages to disk.  That way, after a crash,
> > the current database only shows transactions as committed where we are
> > sure all the data has made it to disk.
>
> OK as far as it goes, but probably bad for concurrancy if I have understood
> you.

Interesed in hearing your comments.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian writes:
> I am adding this to the TODO list:
>
>     * Do async I/O to do better read-ahead of data

Good.

> Because we are not threaded, we really can't do anything else while we
> are waiting for I/O, but we can pre-request data we know we will need.

Threading is a bit like raw devices. It sounds like a really good idea,
particularly with M$ banging the "NT, now with threads" drum, but in real
life there are some very good reasons not to thread. Particularly with an
extensible product like Postgres where J-Random routine gets loaded at
runtime. In a threaded system, J-Random routine needs to be pretty well
perfect or the whole system comes down. In a process based system, unless
it trashes something in the shared memory, only the one connection instance
needs to come down. My experience with Illustra says that this is fairly
important.

The other big problem with threading is that now the stacks and all dynamically
allocated data are in the shared memory and are not easily extendable. So,
if some recursive procedure (like in the rewriter) uses a bit of extra stack
some other thread gets its stack trashed. This is never pretty.

Or if some user function loads a giant large object (like an mpeg say), that
memory has to come out of the shared memory, now if they leak that memory,
it is gone for good. In a per process system, it just ends up using a little
more swap space.

The other thing threading does is introduce new synchronization requirements
into things that never needed it before. For example, Postgres is pretty
free with memory allocation and deallocation (think Nodes!). With threading
each palloc() and pfree() is going to have to take a round trip through a
mutex. This makes an expensive operation even more costly. By and large,
the dbms's that are threaded have had pretty static (ie pre-allocate every
thing in arrays at boot time) memory models. Postgres does not fit this
picture very well.

Ultimately, threading may buy some performance, but not very much compared
to how hard it is to get right and how brittle it tends to make the system.

Unless I have misunderstood the state of Postgres, there is a vast amount
of performance improvement to be had without even thinking about threading.
If it were me, I would pick up the easy stuff, then the middle-hard stuff
with the really big wins like a proper transaction log, and leave the very
hard stuff like threading until last.

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
ocie@paracel.com
Date:
David Gould wrote:
>
> Bruce Momjian writes:
> > I am adding this to the TODO list:
> >
> >     * Do async I/O to do better read-ahead of data
>
> Good.
>
> > Because we are not threaded, we really can't do anything else while we
> > are waiting for I/O, but we can pre-request data we know we will need.
>
> Threading is a bit like raw devices. It sounds like a really good idea,
> particularly with M$ banging the "NT, now with threads" drum, but in real
> life there are some very good reasons not to thread. Particularly with an
> extensible product like Postgres where J-Random routine gets loaded at
> runtime. In a threaded system, J-Random routine needs to be pretty well
> perfect or the whole system comes down. In a process based system, unless
> it trashes something in the shared memory, only the one connection instance
> needs to come down. My experience with Illustra says that this is fairly
> important.

I tend to agree.  I think threads are more usefull in applications
software, where the programmer has at least some idea of the use of
the program.

On a more practical/implementation note, it might be nice to have one
process that can perform prefetches for the others.  I.E.  If I know
that I will need page X, I queue a request to this process (in shared
memory?) and it will read in this page for us (or even write a page to
disk for us).

Asynchronous I/O is interesting, but I don't know how widely it has
been implemented (Linux has none, unless the development version is
working on it, and Solaris has aio, but not the POSIX flavor yet).


Ocie Mitchell

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Bruce Momjian
Date:
> Threading is a bit like raw devices. It sounds like a really good idea,
> particularly with M$ banging the "NT, now with threads" drum, but in real
> life there are some very good reasons not to thread. Particularly with an
> extensible product like Postgres where J-Random routine gets loaded at
> runtime. In a threaded system, J-Random routine needs to be pretty well
> perfect or the whole system comes down. In a process based system, unless
> it trashes something in the shared memory, only the one connection instance
> needs to come down. My experience with Illustra says that this is fairly
> important.

Yes, the threading topic has come up before, and I have never considered
it a big win.  We want to remove the exec() from the startup, so we just
do a fork.  Will save 0.001 seconds of startup.

That is a very easy win for us.  I hadn't considered the synchonization
problems with palloc/pfree, and that could be a real problem.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
Michal Mosiewicz
Date:
Bruce Momjian wrote:

> Yes, the threading topic has come up before, and I have never considered
> it a big win.  We want to remove the exec() from the startup, so we just
> do a fork.  Will save 0.001 seconds of startup.

As I was always biased to threading I would note that in many cases it
is a big win. First of all, today it's the IO which is usually the
slowest part of the database. Most of good IO designs have different
optimisations like tag queuing etc. But, to make use of all this stuff
you have to generate your IO requests as soon as possible. If you
process your query in one thread, most of the time your processor waits
for the IO, but sometimes your IO waits for the processor. If you
populate your IO system with a bunch of IO requests paralelly,
controllers may try to requeue them to optimise heads movements etc.
However, if you do some IO, then some processing, then some IO.... you
loose the capability of optimising your requests. Unless your system is
heavily loaded. In this case, it actually doesn't matter how you
schedule IO requests. But if your system is not loaded too heavily, it's
good to parallelize IO tasks. And the easiest way to accomplish it is to
use threads for parallel execution of tasks.

But I notice that many people still think of threads as a replacement
for fork. Of course, in such case it's pretty useless since fork is fast
enough. But the key to the success is to parallelize single queries not
only to leverage processor usage, but also to push IO to it's maximum.

> That is a very easy win for us.  I hadn't considered the synchonization
> problems with palloc/pfree, and that could be a real problem.

Few months ago I was thinking about it. Actually I don't see much
problems with things like palloc/pfree. I don't see any problems with
any heap data that is used locally. But it is a big problem when you
take a look at global variables and global data that is accessed and
modified in many places. This is potential source of troubles.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
dg@illustra.com (David Gould)
Date:
Michal Mosiewicz writes:
> As I was always biased to threading I would note that in many cases it
> is a big win. First of all, today it's the IO which is usually the
> slowest part of the database. ...
...
> However, if you do some IO, then some processing, then some IO.... you
> loose the capability of optimising your requests. ...
> ... But if your system is not loaded too heavily, it's
> good to parallelize IO tasks. And the easiest way to accomplish it is to
> use threads for parallel execution of tasks.

Agreed, but what you are talking about here is decomposing a query into
it's parallel components and executing them in parallel. This is a win
of course, but the optimizer and executor have to support it. Also, you
start to want things like table fragmentation across devices to make this
work. A big job. As a shortcut, you can just do some lookahead on index scans
and do prefetch. Doesn't buy as much, but could probably be done very
quickly.

> But I notice that many people still think of threads as a replacement
> for fork. Of course, in such case it's pretty useless since fork is fast
> enough. But the key to the success is to parallelize single queries not
> only to leverage processor usage, but also to push IO to it's maximum.

This is indeed what I was thinking about. The process per connection scheme
of Postgres is often criticised vs a thread per connection scheme as in
Sybase for example. I was responding to that criticism.

> > That is a very easy win for us.  I hadn't considered the synchonization
> > problems with palloc/pfree, and that could be a real problem.
>
> Few months ago I was thinking about it. Actually I don't see much
> problems with things like palloc/pfree. I don't see any problems with

If you have multiple threads each allocing memory at the same time, the
allocator data structures have to be protected.

> any heap data that is used locally. But it is a big problem when you
> take a look at global variables and global data that is accessed and
> modified in many places. This is potential source of troubles.

Too right.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - I realize now that irony has no place in business communications.


Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?t

From
Bruce Momjian
Date:
> Agreed, but what you are talking about here is decomposing a query into
> it's parallel components and executing them in parallel. This is a win
> of course, but the optimizer and executor have to support it. Also, you
> start to want things like table fragmentation across devices to make this
> work. A big job. As a shortcut, you can just do some lookahead on index scans
> and do prefetch. Doesn't buy as much, but could probably be done very
> quickly.

Yes, this is what I was thinking.  Could be done pretty easily, and
hence the TODO addition.

> If you have multiple threads each allocing memory at the same time, the
> allocator data structures have to be protected.
>
> > any heap data that is used locally. But it is a big problem when you
> > take a look at global variables and global data that is accessed and
> > modified in many places. This is potential source of troubles.
>
> Too right.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

From
"Vadim B. Mikheev"
Date:
Chris Albertson wrote:
>
> I know you must have looked at far more Postgresql code then I have but
> I was browsing the storage manager.  Apparently it is fairly easy to
> assign a class to a manager as each class is tagged in the system catalog
> with a storage method.  What I really want is a >2GB table.  I was trying
> to see if this was supported by reading the source.  Looks like it may be.
> The note in the To Do list includes testing.  I would test it but for
> lack of disk space. (I'll have more in a while.)
  ^^^^^^^^^^^^^^^^^^

src/backend/storage/smgr/md.c:

#define RELSEG_SIZE ...

- just decrease it!

Vadim