Thread: [hackers]development suggestion needed

[hackers]development suggestion needed

From
xun@cs.ucsb.edu (Xun Cheng)
Date:
Hi, everyone, I like your work very much and hope PostgreSQL can
grow into something competitive with Oracle just like Linux vs.
Windows.

I have background in relational database management system
research and I want to try to be a developer for PostgreSQL.
Right now I only try to be familiar with your code base. I
plan to start with a specific function module in the backend.
I'm thinking of /docs/pgsql/src/backend/executor because
I want to experiment with some new fast join algorithms.
My long term objective is to introduce materialized view
subsystem into PostgreSQL. Could anyone tell me if
the directory /docs/pgsql/src/backend/executor is the 
right place to start or just give me some general suggestions
which are not in the FAQs? Oh one more thing I want to
mention is that those join algorithms I want to experiment
with may have some special data access paths similar to an index.

Further if it doesn't bother you much, could someone
answer the following question(s) for me? (Sorry if
some are already in the docs)
1. Does postgresql do raw storage device management or it relies  on file system? My impression is no raw device. If
no, is it difficult to add it and possibly how?
 
2. Do you have standard benchmark results for postgresql?  I guess not since it only implements a subset of SQL'92.
Whatabout subset of a benchmark or something repeatable?
 
3. Suppose I have added a new two rel. join algorithm, how  would I proceed to compare the performance of it with   the
exisitingtwo relation join algorithms under  different senarios? Are there any existing facilities  in the current code
basefor this purpose? Am I right  that the available join algos implemented are nested loop  join (including
index-based),hash join (which one? hybrid),  sort-merge join?
 
4. Usually a single sequential pass of a large joining relation  is preferred to random access in large join operation.
It's mostly because of the current disk access characteristics.  Is it possible for me to do some benchmarking about
this using postgresql? What I'm actually asking are the issues about   how to control the flow of data form disk to
buffers, how to stop file system interference and how to arrange  actual data placement on the disk.
 

Sorry again if I'm not clear with my questions. I'd like
to further explain them if necessary.

thanks for any help
xun
   



Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> I have background in relational database management system
> research and I want to try to be a developer for PostgreSQL.
> Right now I only try to be familiar with your code base. I
> plan to start with a specific function module in the backend.
> I'm thinking of /docs/pgsql/src/backend/executor because
> I want to experiment with some new fast join algorithms.
> My long term objective is to introduce materialized view
> subsystem into PostgreSQL. Could anyone tell me if
> the directory /docs/pgsql/src/backend/executor is the 
> right place to start or just give me some general suggestions
> which are not in the FAQs? Oh one more thing I want to
> mention is that those join algorithms I want to experiment
> with may have some special data access paths similar to an index.

Good.

> 
> Further if it doesn't bother you much, could someone
> answer the following question(s) for me? (Sorry if
> some are already in the docs)
> 1. Does postgresql do raw storage device management or it relies
>    on file system? My impression is no raw device. If no,
>    is it difficult to add it and possibly how?

No, only file system.  We don't see much advantage to raw i/o.

> 2. Do you have standard benchmark results for postgresql?
>    I guess not since it only implements a subset of SQL'92.
>    What about subset of a benchmark or something repeatable?

We do the Wisconsin.  I think it is in the source tree.

> 3. Suppose I have added a new two rel. join algorithm, how
>    would I proceed to compare the performance of it with 
>    the exisiting two relation join algorithms under
>    different senarios? Are there any existing facilities
>    in the current code base for this purpose? Am I right
>    that the available join algos implemented are nested loop
>    join (including index-based), hash join (which one? hybrid),
>    sort-merge join?

You can control the join types used with flags to postgres.  Very easy.

> 4. Usually a single sequential pass of a large joining relation
>    is preferred to random access in large join operation.
>    It's mostly because of the current disk access characteristics.
>    Is it possible for me to do some benchmarking about this
>    using postgresql? What I'm actually asking are the issues about 
>    how to control the flow of data form disk to buffers,
>    how to stop file system interference and how to arrange
>    actual data placement on the disk.

Good idea.  We deal with this regularly in deciding to use an index in
the optimizer or a sequential scan.  Our optimizer is quite good.



--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Tom Lane
Date:
xun@cs.ucsb.edu (Xun Cheng) writes:
> I want to experiment with some new fast join algorithms.

Cool.  Welcome aboard!

> Could anyone tell me if
> the directory /docs/pgsql/src/backend/executor is the 
> right place to start

The executor is only half the problem: you must also teach the
planner/optimizer how and when to use the new join type.

Hiroshi Inoue has recently been down this path (adding support
for TID-based scans), and might be able to give you more specific
advice.

> 1. Does postgresql do raw storage device management or it relies
>    on file system? My impression is no raw device. If no,
>    is it difficult to add it and possibly how?

Postgres uses Unix files.  We have avoided raw-device access mostly on
grounds of portability.  To persuade people that such a change should go
into the distribution, you'd need to prove that *significantly* better
performance is obtained with raw access.  I for one don't think it's a
foregone conclusion; Postgres gets considerable benefit from sitting
atop Unix kernel device schedulers and disk buffer caches.

As far as the actual implementation goes, the low level access methods
go through a "storage manager" switch that was intended to allow for
the addition of a new storage manager, such as a raw-device manager.
So you could get a good deal of stuff working by implementing code that
parallels md.c/fd.c.  The main problem at this point is that there is a
fair amount of utility code that goes out and does its own manipulation
of the database file structure.  You'd need to clean that up by pushing
it all down below the storage manager switch (inventing new storage
manager calls as needed).

>    that the available join algos implemented are nested loop
>    join (including index-based), hash join (which one? hybrid),
>    sort-merge join?

Right.  The hash join uses batching if it estimates that the relation
is too large to fit in memory; is that what you call "hybrid"?

> 4. Usually a single sequential pass of a large joining relation
>    is preferred to random access in large join operation.
>    It's mostly because of the current disk access characteristics.
>    Is it possible for me to do some benchmarking about this
>    using postgresql? What I'm actually asking are the issues about 
>    how to control the flow of data form disk to buffers,
>    how to stop file system interference and how to arrange
>    actual data placement on the disk.

You don't get to do either of the latter two unless you write a
raw-device storage manager --- which'd be a fair amount of work
for what might be little return.  Are you convinced that that ought
to be the first thing you work on?  I'd be inclined to think about
join algorithms in the abstract, without trying to control physical
disk placement of the data...
        regards, tom lane


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 08:23 PM 1/13/00 -0500, Tom Lane wrote:

>As far as the actual implementation goes, the low level access methods
>go through a "storage manager" switch that was intended to allow for
>the addition of a new storage manager, such as a raw-device manager.
>So you could get a good deal of stuff working by implementing code that
>parallels md.c/fd.c.  The main problem at this point is that there is a
>fair amount of utility code that goes out and does its own manipulation
>of the database file structure.  You'd need to clean that up by pushing
>it all down below the storage manager switch (inventing new storage
>manager calls as needed).

This would need to be done to implement some sort of tablespace-style
facility, too, right?  I'm off Xun's thread in asking but I've been
wondering.  DBs like Oracle allow you to place tables and indices
whereever you like in the filesystem.  This is normally done to
distribute things across different spindles, and in large, busy
databases makes a significant difference.  I've done some experimenting
moving index files to a different spindle (using "ln" to fool 
postgres, of course) and insertions go measurably faster.  Spindles
are so cheap nowadays :)

I know there's been discussion of letting folks specify where the
WAL will be placed when it's implemented, for safety's sake - it 
will also improve performance.

>You don't get to do either of the latter two unless you write a
>raw-device storage manager

Not within a single filesystem, but scattering things across spindles
could be done without a raw-device storage manager :)

(not what he's talking about, but heck, thought I'd raise it)



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Tom Lane
Date:
Don Baccus <dhogaza@pacifier.com> writes:
> This would need to be done to implement some sort of tablespace-style
> facility, too, right?  I'm off Xun's thread in asking but I've been
> wondering.  DBs like Oracle allow you to place tables and indices
> whereever you like in the filesystem.  This is normally done to
> distribute things across different spindles, and in large, busy
> databases makes a significant difference.  I've done some experimenting
> moving index files to a different spindle (using "ln" to fool 
> postgres, of course) and insertions go measurably faster.  Spindles
> are so cheap nowadays :)

As you say, you can fake it manually with symbolic links, but that's
a kluge.

The "database location" stuff that Peter and Thomas have been arguing
about is intended to allow a single postmaster to control databases that
are in multiple physical locations --- but there seems to be some debate
as to whether it works ;-).  (I never tried it.)  In any case, we don't
currently have any official provision for controlling location at finer
than database level.  It'd be nice to be able to push individual tables
around, I suppose.

This wouldn't require a new storage manager, since presumably you'd
still be using the Unix-filesystem storage manager.  The trick would be
to allow a path rather than just a base file name to be specified
per-relation.  I'm not sure if it'd be hard or not.  Probably, all the
system tables would have to stay in the database's default directory,
but maybe user tables could be given path names without too much
trouble...
        regards, tom lane


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 09:01 PM 1/13/00 -0500, Tom Lane wrote:

>As you say, you can fake it manually with symbolic links, but that's
>a kluge.

Yes, it is.  Or worse :)
>
>The "database location" stuff that Peter and Thomas have been arguing
>about is intended to allow a single postmaster to control databases that
>are in multiple physical locations --- but there seems to be some debate
>as to whether it works ;-).  (I never tried it.)  In any case, we don't
>currently have any official provision for controlling location at finer
>than database level.  It'd be nice to be able to push individual tables
>around, I suppose.

Putting indices on different spindles than the tables is known to
significantly speed up the Ars Digita Community system under load
with Oracle.  Systems like this, used to back busy web sites, stuff
things into tables many times a second.  As I mentioned, I've played
around a bit with postgres using "ln" and it does indeed help boost
the number of inserts my (paltry, two-spindle) system could sustain.

The selects that such sites spew forth are handled wonderfully
by Postgres now, with MVCC and the change that stops the update
of pg_log after read-only selects.

My site's still in the experimental stage, being used by a couple
dozen folks to record bird distribution data in the Pacific NW, so
I don't personally have real-world data to get a feeling for how
important this might become.  Still, Oracle DBA docs talk a lot
about it so in some real-world scenarios being able to distribute
tables and indices on different spindles must pay off.

>
>This wouldn't require a new storage manager, since presumably you'd
>still be using the Unix-filesystem storage manager.  The trick would be
>to allow a path rather than just a base file name to be specified
>per-relation.  I'm not sure if it'd be hard or not.  Probably, all the
>system tables would have to stay in the database's default directory,
>but maybe user tables could be given path names without too much
>trouble...

I've looked into it, actually, and have reached the same conclusion.
Including the bit about keeping system tables in the database's default
directory.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> As you say, you can fake it manually with symbolic links, but that's
> a kluge.
> 
> The "database location" stuff that Peter and Thomas have been arguing
> about is intended to allow a single postmaster to control databases that
> are in multiple physical locations --- but there seems to be some debate
> as to whether it works ;-).  (I never tried it.)  In any case, we don't
> currently have any official provision for controlling location at finer
> than database level.  It'd be nice to be able to push individual tables
> around, I suppose.
> 
> This wouldn't require a new storage manager, since presumably you'd
> still be using the Unix-filesystem storage manager.  The trick would be
> to allow a path rather than just a base file name to be specified
> per-relation.  I'm not sure if it'd be hard or not.  Probably, all the
> system tables would have to stay in the database's default directory,
> but maybe user tables could be given path names without too much
> trouble...

Or we could continue to use symlinks, and just create them ourselves in
the backend.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> This would need to be done to implement some sort of tablespace-style
> facility, too, right?  I'm off Xun's thread in asking but I've been
> wondering.  DBs like Oracle allow you to place tables and indices
> whereever you like in the filesystem.  This is normally done to
> distribute things across different spindles, and in large, busy
> databases makes a significant difference.  I've done some experimenting
> moving index files to a different spindle (using "ln" to fool 
> postgres, of course) and insertions go measurably faster.  Spindles
> are so cheap nowadays :)

WAL will add the oid to the base file name.  That may make tablespaces
easier.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Tatsuo Ishii
Date:
> The "database location" stuff that Peter and Thomas have been arguing
> about is intended to allow a single postmaster to control databases that
> are in multiple physical locations --- but there seems to be some debate
> as to whether it works ;-).  (I never tried it.)  In any case, we don't
> currently have any official provision for controlling location at finer
> than database level.  It'd be nice to be able to push individual tables
> around, I suppose.
> 
> This wouldn't require a new storage manager, since presumably you'd
> still be using the Unix-filesystem storage manager.  The trick would be
> to allow a path rather than just a base file name to be specified
> per-relation.  I'm not sure if it'd be hard or not.  Probably, all the
> system tables would have to stay in the database's default directory,
> but maybe user tables could be given path names without too much
> trouble...

This is possible since PostgreSQL was born unless I misunderstand what
you are saying...

test=> create table "/tmp/t1" (i int);
CREATE
bash$ ls -l /tmp/t1
-rw-------   1 postgres postgres        0 Jan 14 11:19 /tmp/t1

Even,

test=> create table "../test2/pg_proc" (i int);
ERROR:  cannot create ../test2/pg_proc

This is not good. Maybe we should prevent to make this kind of table
names.

BTW, it would be nice to add a "table space" concept to the create
table statement.

-- reserve a table space named 'foo' which is physically located under
-- /pg/myspace. Only PostgreSQL super user can execute this command
-- to avoid security risks.
create table space foo as '/pg/myspace';

-- create table t1 under /pg/myspace
create table t1 (i int) with table space 'foo';
--
Tatsuo Ishii


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 11:42 AM 1/14/00 +0900, Tatsuo Ishii wrote:

>This is possible since PostgreSQL was born unless I misunderstand what
>you are saying...
>
>test=> create table "/tmp/t1" (i int);

Clever...

Of course, when I'm porting over thousands of lines of an Oracle
data model and tens of thousands of lines of scripting code that
refers to these tables via queries this is a very inconvenient
way to locate a particular table in a particular place.  It involves
changing a lot of code...

Besides being somewhat ... baroque? :)

>BTW, it would be nice to add a "table space" concept to the create
>table statement.

I figured you felt that way!

>
>-- reserve a table space named 'foo' which is physically located under
>-- /pg/myspace. Only PostgreSQL super user can execute this command
>-- to avoid security risks.
>create table space foo as '/pg/myspace';
>
>-- create table t1 under /pg/myspace
>create table t1 (i int) with table space 'foo';

Yes, that's the Oracle-ish style of it.  Of course, Oracle allows
all sorts of anal retentive features like allowing a DBA to
restrict the size of the tablespace, etc that I personally
don't care about...

Though I understand why they're important to some.

Oracle tables and indices within a single tablespace all live in
one file (if you're using filesystem rather than raw I/O), so 
they also provide features which allow you to specify how big
a chunk to allocate per extent (Oracle pre-allocates to avoid
running out of disk space while you're running except in ways
that you control, and in hopes of getting contiguous chunks of
disk storage because they hope you're using previously empty
disks used only for Oracle).

Features like this don't fit well with each table/index residing
in its own file.  Personally I don't have any need for them, either,
but as Postgres gets more popular (as it will as it continues to
improve) it may attract the attention of folks with traditional
DBA requirements like this.

Of course, that would require a new storage manager, one similar
in concept to what would be needed to implement raw I/O.




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> Oracle tables and indices within a single tablespace all live in
> one file (if you're using filesystem rather than raw I/O), so 
> they also provide features which allow you to specify how big
> a chunk to allocate per extent (Oracle pre-allocates to avoid
> running out of disk space while you're running except in ways
> that you control, and in hopes of getting contiguous chunks of
> disk storage because they hope you're using previously empty
> disks used only for Oracle).

And with data and index in the same file, you can't split them across
spindles.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Mike Mascari
Date:
Bruce Momjian wrote:
> 
> > Oracle tables and indices within a single tablespace all live in
> > one file (if you're using filesystem rather than raw I/O), so
> > they also provide features which allow you to specify how big
> > a chunk to allocate per extent (Oracle pre-allocates to avoid
> > running out of disk space while you're running except in ways
> > that you control, and in hopes of getting contiguous chunks of
> > disk storage because they hope you're using previously empty
> > disks used only for Oracle).
> 
> And with data and index in the same file, you can't split them across
> spindles.
> 

But you can certainly do that in ORACLE, if you wish. In fact,
ORACLE recommends it:

Place Data Files for Maximum Performance 

Tablespace location is determined by the physical location of the
data files that constitute that tablespace. Use the hardware
resources of your computer appropriately.

For example, if several disk drives are available to store the
database, it might be helpful to store table data in a tablespace
on one disk drive, and index data in a tablespace on another disk
drive. This way, when users query table information, both disk
drives can work simultaneously, retrieving table and index data
at the same time.

Mike Mascari


Re: [HACKERS] [hackers]development suggestion needed

From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Tom Lane wrote:

> Don Baccus <dhogaza@pacifier.com> writes:
> > This would need to be done to implement some sort of tablespace-style
> > facility, too, right?  I'm off Xun's thread in asking but I've been
> > wondering.  DBs like Oracle allow you to place tables and indices
> > whereever you like in the filesystem.  This is normally done to
> > distribute things across different spindles, and in large, busy
> > databases makes a significant difference.  I've done some experimenting
> > moving index files to a different spindle (using "ln" to fool 
> > postgres, of course) and insertions go measurably faster.  Spindles
> > are so cheap nowadays :)
> 
> As you say, you can fake it manually with symbolic links, but that's
> a kluge.
> 
> The "database location" stuff that Peter and Thomas have been arguing
> about is intended to allow a single postmaster to control databases that
> are in multiple physical locations --- but there seems to be some debate
> as to whether it works ;-).  (I never tried it.)  In any case, we don't
> currently have any official provision for controlling location at finer
> than database level.  It'd be nice to be able to push individual tables
> around, I suppose.
> 
> This wouldn't require a new storage manager, since presumably you'd
> still be using the Unix-filesystem storage manager.  The trick would be
> to allow a path rather than just a base file name to be specified
> per-relation.  I'm not sure if it'd be hard or not.  Probably, all the
> system tables would have to stay in the database's default directory,
> but maybe user tables could be given path names without too much
> trouble...

Okay, I've been thinking about this recently with the whole Udmsearch of
PostgreSQL.  We just put a 9gig drive online to handle this, as well as
other database related projects, since I wanted alot of room to grow
(PostgreSQL itself indexed out to something like 1gig, and the lists are
growing) ...

All the major OSs out there have "disk management tools" that allow you to
build "large file systems" out of smaller ones... Solaris has DiskSuite,
FreeBSD has vinum, Linux has ??... why are we looking/investigating adding
a level of complexity to PostgreSQL to handle something that, as far as I
know, each of the OSs out there already has a way of dealing with?

Some aren't necessarily mature yet...Solaris's is the only one that I'm
aware of that has a *beautiful* growfs program that allows you to add a
new drive to an existing "pack" and grow the file system into that new
drive while the system is live...but the utilities are there...

I think the major problem that I'm worried about isn't spreading tables
across drives, but its when that *one* table grows to the point that its
about to overflow my drive...I'd rather add a 9gig drive on, make it an
18gig file system, and let it continue to grow...

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



RE: [HACKERS] [hackers]development suggestion needed

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tatsuo Ishii
> 
> 
> BTW, it would be nice to add a "table space" concept to the create
> table statement.
> 
> -- reserve a table space named 'foo' which is physically located under
> -- /pg/myspace. Only PostgreSQL super user can execute this command
> -- to avoid security risks.
> create table space foo as '/pg/myspace';
> 
> -- create table t1 under /pg/myspace
> create table t1 (i int) with table space 'foo';
> --

I agree with Tatsuo though I prefercreate table t1 (i int) tablespace foo;
.
Isn't it preferable to encapsulate the table location and storage type ?

At first,a tablespace would only correspond to a directory and it won't
be so difficult to implment. But we would gain a lot with the feature.

In the future,the tablespace may be changed to mean real(??)
tablespace.  

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )

From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Don Baccus wrote:

> My site's still in the experimental stage, being used by a couple
> dozen folks to record bird distribution data in the Pacific NW, so
> I don't personally have real-world data to get a feeling for how
> important this might become.  Still, Oracle DBA docs talk a lot
> about it so in some real-world scenarios being able to distribute
> tables and indices on different spindles must pay off.

What would it take to break the data/base/<database> directory down?  To
something like, maybe:

data/base/<database>/pg_*                   /tables/*                   /indices/*

Then, one could easily mount a drive as /tables and another one as
/indices ...

We know the difference between a table and an index, so I wouldn't think
it would be *too* hard add /tables/ internally to the existing
path...would it?

You'd basically have somethign like:

sprintf("%s/data/base/%s/tables/%s", data_dir, database, tablename);

Instead of:

sprintf("%s/data/base/%s/%s", data_dir, database, tablename);

I know, I'm being simplistic here, but...

Or, a different way:

if(table) sprintf("%s/data/base/table/%s/%s", data_dir,database,tablename);
else if(index) sprintf("%s/data/base/index/%s/%s", data_dir,database,tablename);
else sprintf("%s/data/base/sys/%s/%s", data_dir,database,sysfile);

This would give you the ability to put all table from all databass onto
one file system, and all indexes onto another, and all system files onto a
third...

I don't know, I'm oversimplying and spewing thoughts out
again...but...*shrug*

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



> On Thu, 13 Jan 2000, Don Baccus wrote:
> 
> > My site's still in the experimental stage, being used by a couple
> > dozen folks to record bird distribution data in the Pacific NW, so
> > I don't personally have real-world data to get a feeling for how
> > important this might become.  Still, Oracle DBA docs talk a lot
> > about it so in some real-world scenarios being able to distribute
> > tables and indices on different spindles must pay off.
> 
> What would it take to break the data/base/<database> directory down?  To
> something like, maybe:
> 
> data/base/<database>/pg_*
>                     /tables/*
>                     /indices/*

And put sort and large objects somewhere separate too.
--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Tatsuo Ishii
Date:
>I agree with Tatsuo though I prefer
>    create table t1 (i int) tablespace foo;
>.
>Isn't it preferable to encapsulate the table location and storage type ?

Agreed.

>At first,a tablespace would only correspond to a directory and it won't
>be so difficult to implment. But we would gain a lot with the feature.
>
>In the future,the tablespace may be changed to mean real(??)
>tablespace.  

Good point.

> I think the major problem that I'm worried about isn't spreading tables
> across drives, but its when that *one* table grows to the point that its
> about to overflow my drive...I'd rather add a 9gig drive on, make it an
> 18gig file system, and let it continue to grow...

We could extend the create tablespace command something like:

create tablespace foo as '/pg/myspace1 /pg/myspace2 ';

to spread a table (space) among different disk drives.  Moreover we
could define the "policy" to use the tablespace:

create tablespace foo as '/pg/myspace1 /pg/myspace2 ' policy roundrobin;

in above case, if the table hits the 1GB limit in /pg/myspace1 then
new segment will be created in /pg/myspace2.

Just an idea...
--
Tatsuo Ishii



Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )

From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Bruce Momjian wrote:

> > On Thu, 13 Jan 2000, Don Baccus wrote:
> > 
> > > My site's still in the experimental stage, being used by a couple
> > > dozen folks to record bird distribution data in the Pacific NW, so
> > > I don't personally have real-world data to get a feeling for how
> > > important this might become.  Still, Oracle DBA docs talk a lot
> > > about it so in some real-world scenarios being able to distribute
> > > tables and indices on different spindles must pay off.
> > 
> > What would it take to break the data/base/<database> directory down?  To
> > something like, maybe:
> > 
> > data/base/<database>/pg_*
> >                     /tables/*
> >                     /indices/*
> 
> And put sort and large objects somewhere separate too.

why not?  by default, one drive, it would make no difference except for
file layout, but it would *really* give room to expand...

Right now, the udmsearch database contains (approx):

tables:10528 dict10 5088 dict11 2608 dict12 3232 dict1664336 dict247960 dict3 3096 dict3265952 dict442944 dict536384
dict634792dict721008 dict814120 dict931912 url
 

indexs: 5216 url_id10 2704 url_id11 1408 url_id12 1648 url_id1636440 url_id227128 url_id3 1032 url_id3237416
url_id422600url_id519096 url_id618248 url_id710880 url_id8 6920 url_id9 6464 word10 3256 word11 1672 word12 2280
word1626344word221200 word3 2704 word3228720 word421880 word519240 word618464 word711952 word8 8864 word9
 

if tables/indexs were in different subdirectories, it would be too easy
for me, at some point in the future, to take just the tables directory and
put them on their own dedicated drive, halving the space used on either
drive...

I don't know...IMHO, it sounds like the simplist solution that provides
the multi-spindle benefits ppl are suggesting...



At 12:31 AM 1/14/00 -0400, The Hermit Hacker wrote:

>This would give you the ability to put all table from all databass onto
>one file system, and all indexes onto another, and all system files onto a
>third...
>
>I don't know, I'm oversimplying and spewing thoughts out
>again...but...*shrug*

This kind of hack would certainly be doable, but I guess the question
arises once again - is PostgreSQL shooting to be the Big Time or not?

I mean, the mere use of words like "Bronze Support" and "Silver Support"
remind one of Oracle :)

This particular issue of the placement of tables and indices pales
in importance compared to outer joins, for instance.  But once all
the big things are implemented, folks doing BIG JOBS will look at
Postgres as being a viable alternative.  And they'll then be disappointed
if they don't have the kind of control over files that they do with
Oracle or other big-time commercial DBs...the folks talking about
replication are coming from the same space, though much more ambitiously
(since a simple tablespace feature could be simple, while I can't
think of any simple replication hacks).

I'd like to see Postgres succeed in a big way.  I don't see it toppling
Oracle, but heck I can't see why Interbase can't be ground into dust.
Open Source, great functionality, maybe B+ on scalability etc (thus not
toppling Oracle but equal to most others) ... that's not too ambitious
a goal, is it?



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


RE: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 01:34 PM 1/14/00 +0900, Hiroshi Inoue wrote:

>I agree with Tatsuo though I prefer
>    create table t1 (i int) tablespace foo;
>.
>Isn't it preferable to encapsulate the table location and storage type ?

Encapsulation is almost always preferable, but I don't think Tatsuo was
saying otherwise, merely pointing out a clever trick that I hadn't
though of (I didn't realize that "/foo" would be rooted rather than
just put under PGDATA, in fact I stay away from quoted non-SQL non-"normal
language" identifiers altogether, being somewhat interested in portability
of my application code).

And it is a clever trick...obvious to an insider, but not to me.

>At first,a tablespace would only correspond to a directory and it won't
>be so difficult to implment. But we would gain a lot with the feature.

>In the future,the tablespace may be changed to mean real(??)
>tablespace.  

I think the decision to keep the mechanism for providing separate
storage mangers was a good one (if I understand correctly that there
was once consideration of removing it).  Even if it is never used in
mainstream Postgres, some specialty application may use it, one of
the potentials that comes from open source software.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 10:18 PM 1/13/00 -0500, Bruce Momjian wrote:
>> Oracle tables and indices within a single tablespace all live in
>> one file (if you're using filesystem rather than raw I/O), so 
>> they also provide features which allow you to specify how big
>> a chunk to allocate per extent (Oracle pre-allocates to avoid
>> running out of disk space while you're running except in ways
>> that you control, and in hopes of getting contiguous chunks of
>> disk storage because they hope you're using previously empty
>> disks used only for Oracle).
>
>And with data and index in the same file, you can't split them across
>spindles.

Which is why folks define more than one tablespace, eh?  Something
you can't do in PG...

Perhaps I didn't make it clear that you can define as many
tablespaces as you want?

And freely assign any table or index to any tablespace you
want?

Given my statement above, it is clear you can:

1. Coalesce indices and tables into a single tablespace (the  default) if you only define one (again, more or less how
Oracledefaults, though I sorta forget because I'm not an  Oracle stud and no one in their right mind allows Oracle to
setdefaults, because they're always wrong)
 

-or-

2. At the other extreme, you can define as many tablespaces as   you have tables and indices, and each can live in
theirown  separate tablespace (i.e. spindle, if that is what you want  to do).
 

-or-

3. Set yourself up at any point between either extreme, according  to your own needs.

I don't think it's that difficult to understand, is it?



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 12:21 AM 1/14/00 -0400, The Hermit Hacker wrote:

>All the major OSs out there have "disk management tools" that allow you to
>build "large file systems" out of smaller ones... Solaris has DiskSuite,
>FreeBSD has vinum, Linux has ??... why are we looking/investigating adding
>a level of complexity to PostgreSQL to handle something that, as far as I
>know, each of the OSs out there already has a way of dealing with?

If the resident OS can handle the issue, sure, it is worth investigating.
Linux today does not (at least, the one I'm running).

One godliness-over-utility issue is the fact that doing such things in
the operating system ("ln" also works...) kinda violates the RDBMS ideal
of having everything about a database, including metadata, stored in
the database.  

In the case of "CREATE TABLESPACE" having Postgres handle placement
places the burden of operating system specifics where it belongs - on the 
implementation.  This is why we say things like "integer" or "numeric",
come to think of it...

The word "portability" comes to mind, though of course things like
spindle numbers and the like are extremely variable.  

>Some aren't necessarily mature yet...Solaris's is the only one that I'm
>aware of that has a *beautiful* growfs program that allows you to add a
>new drive to an existing "pack" and grow the file system into that new
>drive while the system is live...but the utilities are there...
>
>I think the major problem that I'm worried about isn't spreading tables
>across drives, but its when that *one* table grows to the point that its
>about to overflow my drive...I'd rather add a 9gig drive on, make it an
>18gig file system, and let it continue to grow...

These aren't mutally exclusive problems, which is one reason why Oracle
allows you to control things so minutely.  I think they let you spill
a table onto multiple drives, though I'd have to look at one of my
manuals hidden in my piles of more interesting things (I'm no Oracle
expert, I just read manuals :)

There is definitely a sort of tension between the operating systems,
which continue to grow in capability such as you're pointing out,
and commercial systems like Oracle that have to work TODAY regardless
of where operating systems sit on the capability yardstick.

Thus Oracle includes built-in mirroring, while today under Linux
you might as well do software RAID 1, or you can buy a hardware
device that does RAID 1 behind your back and looking like a single
drive no matter how many platter you stuff it with, etc etc.

So...you'll never hear me argue that Postgres should include
a mirroring storage manager.  There is no longer the need for an 
application to do it on its own in the supported OS space (hmmm...I'm
assuming FreeBSD is there, too, right?)

So maybe the notion of application placement of files on particular
spindles is becoming obsolete, too.  It isn't today on Linux...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


At 11:41 PM 1/13/00 -0500, Bruce Momjian wrote:

>And put sort and large objects somewhere separate too.

Yeah, very very good point!  Though toasted or roasted or toasty or
whatever large objects (I unsubscribed for a few crucial days when
I was back east travelling around with my girlfriend) might make
the large-object issue less important?

Also, for sorting, many sites will just load down with RAM, and this
will increase rapidly over the next few years (despite current
SDRAM high prices and the whole RDRAM fiasco).  But really really
big sites might appreciate such a feature...




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


At 12:59 AM 1/14/00 -0400, The Hermit Hacker wrote:

>if tables/indexs were in different subdirectories, it would be too easy
>for me, at some point in the future, to take just the tables directory and
>put them on their own dedicated drive, halving the space used on either
>drive...
>
>I don't know...IMHO, it sounds like the simplist solution that provides
>the multi-spindle benefits ppl are suggesting...

Splitting tables/indexes seems to be the first-order optimization, from
my talking to folks who are far more experienced with databases than
I (I did mention I wrote my first query less than a year ago, didn't
I?)

Still...encapsulation within the RDBMS itself seems to be in the
spirit of what RDBMS's are all about...such encapsulation could
be expressed in very simple external form and still be useful, but
I think it should be encapsulated...

Among other things, if CREATE TABLESPACE were dumped by pg_dump,
I could move from V7.0 to V8.0 and beyond without having to 
rebuild my distribution structure by hand :)



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


RE: [HACKERS] [hackers]development suggestion needed

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> 
> xun@cs.ucsb.edu (Xun Cheng) writes:
> > I want to experiment with some new fast join algorithms.
> 
> Cool.  Welcome aboard!
> 
> > Could anyone tell me if
> > the directory /docs/pgsql/src/backend/executor is the 
> > right place to start
> 
> The executor is only half the problem: you must also teach the
> planner/optimizer how and when to use the new join type.
> 
> Hiroshi Inoue has recently been down this path (adding support
> for TID-based scans), and might be able to give you more specific
> advice.
>

Hmm,un(??)fortunately,I didn't have to understand details about join
to implement Tidscan. It's a basic scan and is used to scan one relation.

I don't know about document well either,sorry.  AFAIK,Tom is
decidedly superior to me in understanding planner/optimizer. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 10:43 PM 1/13/00 -0500, Mike Mascari wrote:

>For example, if several disk drives are available to store the
>database, it might be helpful to store table data in a tablespace
>on one disk drive, and index data in a tablespace on another disk
>drive.

My gosh, imagine - I didn't just make this up!  The amazement :)

> This way, when users query table information, both disk
>drives can work simultaneously, retrieving table and index data
>at the same time.

Overlapped Disk I/O - I remember that term from way back in my
(gulp) PDP-8 days, working as a system hacker...

Look - in case there's any doubt, I'm not trying to toast Postgres,
I'm a fan, interested in getting more involved in the development
scenario.  I raised this issue because Xun raised some "really big
database issues which I as a database theorist have an interest in".
My biggest sin if any is to try to paint the horizon, at this point.
Philip Greenspun still says that those of us (including employee
#3 or so of his company, Ben) who are interested in Postgres are "losers"
by definition (Ben no longer works there).  Meanwhile, folks like Ben
and I keep telling folks that Postgres is growing into an ideal 
RDBMS for database-backed websites (you know, that place where all
money is flowing and will continue to do so tomorrow, though don't
ask me about next week? :)  And Philip says you're a loser if you
won't pay Oracle's license fee.  He speaks as a dude badly bitten
by Illustra, based long ago on a long-dead version of Postgres but
the pain not yet forgotten...

Things like the Oracle documentation cited above fall into the class
of advice to folks running really big - and REALLY BUSY - database
servers.

Sure, hardware (cycles, RAM) fallsin price and as time goes on we
can perhaps forget some of the more horrific optimization stuff that
was invented to deal with small computer systems of one decade ago.

As a compiler writer, trust me - I'm familiar with the concept.  And
with changing pardigms as designs flow from CISC to RISC (oh gosh,
not a theoretical advantage but you mean just a cost/performance point
on the transistor-per-chip curve?  Damn, I should've patented my 
cynicism 10 years ago!) and back to post-RISC CISC, I'm not about
to claim theoretical long-term advantages for any point-of-view.

I won't suggest that all of the big-time hacks employed to make old
commercial DBs like Oracle are necessary in today's hardware/OS climate
(raw I/O in particular falls into that category, IMH-tentative-O)

But, still...as long as we've had movable head disk drives (and my
first disk drive was head-per-track, believe it or not) minimizing
head movement has been an optimization goal.  I've written complex
scheduling disk drivers in the past, and they can be good.  Still,
nothing beats coalescing one spindle's I/O into a narrow number of
tracks, minimizing head movement.  That's a constant that hasn't
changed for 30 years, and won't change next week.

Heck, it even lowers your downtime due to blown drives.

I might also add that the kind of systems Oracle doc writers were
thinking of 10 years ago just aren't in the Postgres universe of
possible DB clients...

But, it is changing.  One impact - like it or not - of the good work
you folks have done over the past couple of (3 or 4 or I'm not personally
sure how much) years and the fact that you continue to push the db into
being more and more capable, more and more stable, more and more 
feature-filled with SQL-92 stuff is that folks being asked to pay
$25,000 for a fully-paid up license on a PIII500 X86 system (<$2000
in hardware without even shopping, a greater than 10-1 software to
hardware ratio) are going to be looking for a cheaper alternative. 

Of which you folks are one.

So, what's the deal, here...is the goal the Big Time or not?




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> ...
> Or we could continue to use symlinks, and just create them ourselves in
> the backend.

But you'd still need some built-in understanding about where the table
is Really Supposed To Be, because you'd need to be able to create and
delete the symlinks on the fly when the table grows past a 1-Gb segment
boundary (or is shrunk back again by vacuum!).

AFAICS, a reasonable solution still requires storing a location path
for each table --- so you might as well just use that path directly.
        regards, tom lane


Don Baccus wrote:

> I'd like to see Postgres succeed in a big way.  I don't see it toppling
> Oracle, but heck I can't see why Interbase can't be ground into dust.
> Open Source, great functionality, maybe B+ on scalability etc (thus not
> toppling Oracle but equal to most others) ... that's not too ambitious
> a goal, is it?
> 

Shoot for the sky you hit the eagle, shoot for the eagle you hit
the ground....

Mike Mascari


Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> Look - in case there's any doubt, I'm not trying to toast Postgres,
> I'm a fan, interested in getting more involved in the development
> scenario.  I raised this issue because Xun raised some "really big
> database issues which I as a database theorist have an interest in".
> My biggest sin if any is to try to paint the horizon, at this point.
> Philip Greenspun still says that those of us (including employee
> #3 or so of his company, Ben) who are interested in Postgres are "losers"
> by definition (Ben no longer works there).  Meanwhile, folks like Ben

I thought Phil was a big fan of ours.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > ...
> > Or we could continue to use symlinks, and just create them ourselves in
> > the backend.
> 
> But you'd still need some built-in understanding about where the table
> is Really Supposed To Be, because you'd need to be able to create and
> delete the symlinks on the fly when the table grows past a 1-Gb segment
> boundary (or is shrunk back again by vacuum!).
> 
> AFAICS, a reasonable solution still requires storing a location path
> for each table --- so you might as well just use that path directly.

Makes sense.  The only advantage to symlinks is that you could use that
information in places you need it, and for normal access use the
symlinks.  You wouldn't have to carry around that info as much.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


> I'd like to see Postgres succeed in a big way.  I don't see it toppling
> Oracle, but heck I can't see why Interbase can't be ground into dust.
> Open Source, great functionality, maybe B+ on scalability etc (thus not
> toppling Oracle but equal to most others) ... that's not too ambitious
> a goal, is it?
> 

We don't block people for working on ietms.  However, we do try and set
priorities based on the open items we have.  I hope we don't appear
heavy-handed in this regard.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [hackers]development suggestion needed

From
Adriaan Joubert
Date:
>You don't get to do either of the latter two unless you write a

> >raw-device storage manager
>
> Not within a single filesystem, but scattering things across spindles
> could be done without a raw-device storage manager :)

Yes, but seen how cheap RAID arrays have become? I know disks are getting
bigger as well, and many people will opt for a single disk, but there may
be more urgent things to fix than something for which a hardware solution
already exists. And lets face it: a database ought to be on RAID
anyway,unless somebody wants to write Tandem-style mirrored disks.... ;-)

Adriaan







Re: [HACKERS] [hackers]development suggestion needed

From
Oleg Bartunov
Date:
On Thu, 13 Jan 2000, Don Baccus wrote:

> Date: Thu, 13 Jan 2000 18:19:33 -0800
> From: Don Baccus <dhogaza@pacifier.com>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Xun Cheng <xun@cs.ucsb.edu>, pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] [hackers]development suggestion needed 
> 
...skipped...

> The selects that such sites spew forth are handled wonderfully
> by Postgres now, with MVCC and the change that stops the update
> of pg_log after read-only selects. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
Just curious,
Does plain 6.5.3 handle  read-only selects in this way ?
  Regards,    Oleg

> 
> - Don Baccus, Portland OR <dhogaza@pacifier.com>
>   Nature photos, on-line guides, Pacific Northwest
>   Rare Bird Alert Service and other goodies at
>   http://donb.photo.net.
> 
> ************
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] [hackers]development suggestion needed

From
Malcolm Beattie
Date:
Don Baccus writes:
> At 12:21 AM 1/14/00 -0400, The Hermit Hacker wrote:
> 
> >All the major OSs out there have "disk management tools" that allow you to
> >build "large file systems" out of smaller ones... Solaris has DiskSuite,
> >FreeBSD has vinum, Linux has ??... why are we looking/investigating adding
> >a level of complexity to PostgreSQL to handle something that, as far as I
> >know, each of the OSs out there already has a way of dealing with?
> 
> If the resident OS can handle the issue, sure, it is worth investigating.
> Linux today does not (at least, the one I'm running).

Linux has software raid (often called "md") with most of the usual
bells and whistles (RAID0, RAID1, RAID5, RAID0+1, hot spares,
background reconstruction). You can patch in LVM (logical volume
management) although the distributions of which I am aware don't ship
it ready-patched. That's the equivalent of Digi^H^H^H^HTru64 UNIX LSM
and AIX and so on have similar things. Basically, join together
physical disk units into logical block devices with additions being
possible on the fly. If you put an ext2 filesystem on one of those,
then you can dynamically resize it with e2resize, although that is not
completely production quality yet and last I heard you could currently
only increase the filesystem size on the fly and not decrease it. ISTR
the competition tend only to allow increase and not shrink but the
ext2 one is designed to allow shrink too. The complexity isn't so much
in the basics ("simply" throw in more block groups and be carefuly
about atomicity if the system is live); it's in stuff like making sure
that the system is robust against fragmentation, goal allocation needs
tweaking (I think) and how you gather together admin information about
where all the bits are. If you break apart the separate disks of a
live filesystem, it's nice to know where all the bits go.

Even with all that underlying stuff, it's *still* important for higher
level configuration at the database level to be possible. Even if from
the theoretical point of view it's all one big page space, it matters
a great deal in practice to be able to spread different bits out over
different table spaces/volumes/files/block devices/whatever.

I think that means I'm in violent agreement with you on the db side
but this reply does give me a chance to point out that Linux isn't
missing the functionality you haven't noticed in it :-).

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services


Re: [HACKERS] [hackers]development suggestion needed

From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Don Baccus wrote:

> So, what's the deal, here...is the goal the Big Time or not?

If it means adopting one vendors concept of what the world should look
like...not.

I *hate* the way Oracle sets up tablespaces ... where i have to pre-guess
the size of my data and allocate space accordingly...what if my table
never does reach that critical mass?  I've just wasted X meg of space ...

I hate the way that Oracle starts up something like 4 processes for every
database, when the server is started...

I think the route we've taken since day one, and hope that we continue
along it...we've created "hacks" along the way to appease users of the
different vendors for SQL relateed issues, but our underlying structure
has stayed, I think, pretty unique...

We haven't been designing a FreeOracle...we've been working on and
designing a Free, full featured RDBMS that follows the specifications laid
down ... with our own PostgreSQLism's thrown in here and there ...

If that happens to follow what one vendor happens to have done as far as
their implementation, great...but there has been no conscious effort to do
so that I'm aware of...

Just look at the whole OUTER JOIN issue ... *shrug*

I *like* the fact that we come up with original ideas/solutions to
problems, that make use of *existing* technologies ...

I liked the thread about moving indexes and tables to seperate file
systems, and hope we can implement something that will make it something
that does't require 'ln's, but I definitely don't like Oracle's way of
doing things ...


Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] [hackers]development suggestion needed

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
>> The selects that such sites spew forth are handled wonderfully
>> by Postgres now, with MVCC and the change that stops the update
>> of pg_log after read-only selects.
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
> Does plain 6.5.3 handle  read-only selects in this way ?

AFAIR that logic is in 6.5.*.  (Vadim would remember better, since he
put it in.)  But keep in mind that a SELECT is read-only just to the
extent that it is hitting previously committed tuples.  The first visit
to a newly committed-good or newly committed-dead tuple will cause an
update and write-back of the tuple's status flags --- whether that visit
happens in SELECT or anything else.

It occurs to me that the no-log-update logic could probably be improved
on.  The test to see whether a log update is needed looks at whether any
buffers have been written.  A SELECT that marks someone else's tuples as
known-committed will look like it needs to be committed in pg_log
... but it doesn't really need it.  Perhaps Vadim is planning to fix
this in the WAL rewrite.
        regards, tom lane


Re: [HACKERS] [hackers]development suggestion needed

From
Thomas Lockhart
Date:
> I thought Phil was a big fan of ours.

He wants to be. But ihho we are not yet worthy :))
                 - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 12:04 PM 1/14/00 +0300, Oleg Bartunov wrote:

>> The selects that such sites spew forth are handled wonderfully
>> by Postgres now, with MVCC and the change that stops the update
>> of pg_log after read-only selects.
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
>Just curious,
>Does plain 6.5.3 handle  read-only selects in this way ?

Yes.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 10:35 AM 1/14/00 +0000, Malcolm Beattie wrote:

>Linux has software raid (often called "md") with most of the usual
>bells and whistles (RAID0, RAID1, RAID5, RAID0+1, hot spares,
>background reconstruction).

Yes, I know.

> You can patch in LVM (logical volume
>management) although the distributions of which I am aware don't ship
>it ready-patched.

Right.

>I think that means I'm in violent agreement with you on the db side
>but this reply does give me a chance to point out that Linux isn't
>missing the functionality you haven't noticed in it :-).

Linux also has a journaling filesystem available, if you're brave and
wanting to be at the bleeding edge.  That's sort of like using next
week's Postgres development sources in yesterday's production environment,
though - a bit risky :)

I should've made it clear that when I was talking about the current,
widespread releases that I'm familiar with.  RedHat, in particular.

The RAID stuff is out-of-the-box supported today, fancier stuff
like the journaling file-system will be out-of-the-box supported before
too much longer.  It's exciting to see Linux improve steadily just as it's
exciting to see Postgres improve steadily.




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 10:11 AM 1/14/00 -0500, Tom Lane wrote:
>Oleg Bartunov <oleg@sai.msu.su> writes:
>>> The selects that such sites spew forth are handled wonderfully
>>> by Postgres now, with MVCC and the change that stops the update
>>> of pg_log after read-only selects.
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
>> Does plain 6.5.3 handle  read-only selects in this way ?
>
>AFAIR that logic is in 6.5.*.  (Vadim would remember better, since he
>put it in.)

It is.  I'd notice right away if it wasn't, the decibel level on my
little database server would go 'way up because it went 'way down when
I applied the patch to my 6.5 beta.  It sits six inches from me so
I'd know for sure!

>It occurs to me that the no-log-update logic could probably be improved
>on.  The test to see whether a log update is needed looks at whether any
>buffers have been written.  A SELECT that marks someone else's tuples as
>known-committed will look like it needs to be committed in pg_log
>... but it doesn't really need it.  Perhaps Vadim is planning to fix
>this in the WAL rewrite.

No idea if he is or isn't, but the patch is very simple and is based
on whether or not buffers got dirty, not whether or not the select 
itself changed anything, IIRC.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 07:58 AM 1/14/00 +0000, Adriaan Joubert wrote:
>>You don't get to do either of the latter two unless you write a
>
>> >raw-device storage manager
>>
>> Not within a single filesystem, but scattering things across spindles
>> could be done without a raw-device storage manager :)
>
>Yes, but seen how cheap RAID arrays have become? I know disks are getting
>bigger as well, and many people will opt for a single disk, but there may
>be more urgent things to fix than something for which a hardware solution
>already exists. And lets face it: a database ought to be on RAID
>anyway,unless somebody wants to write Tandem-style mirrored disks.... ;-)

Don't need to write Tandem-style mirrored disks, the Linux kernal 
implements mirrored file systems for me.  I can mirror UW2 disks in
software for $189/spindle (current cost of an IBM Deskstar UW2 7200 RPM
4.5 GB spindle here in Oregon), the fancier RAID arrays still aren't that
cheap.  

The cheapest RAID interfaces just hide the mirroring from you.  There's
a tier up that take a cluster of mirrored (or RAID 5'd) platters and
present them to you as a single large disk - these remove a lot of
one's control over spindle placement, sure.  My guess is that some 
folks don't view this as a plus...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


At 01:16 AM 1/14/00 -0500, Bruce Momjian wrote:

>We don't block people for working on ietms.  However, we do try and set
>priorities based on the open items we have.  I hope we don't appear
>heavy-handed in this regard.

Not at all.  And though I've triggered this dicussion, I'd be the
first to agree it is minor in importance compared to things like
outer joins, the WAL implementation, and Jan's large object work.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 11:05 AM 1/14/00 -0400, The Hermit Hacker wrote:
>On Thu, 13 Jan 2000, Don Baccus wrote:
>
>> So, what's the deal, here...is the goal the Big Time or not?
>
>If it means adopting one vendors concept of what the world should look
>like...not.

>I *hate* the way Oracle sets up tablespaces ... where i have to pre-guess
>the size of my data and allocate space accordingly...what if my table
>never does reach that critical mass?  I've just wasted X meg of space ...

And I'm not suggesting that anything like this level of (as I described
it in my previous note) anal retentive control be implemented.  Anal
retentive IT managers won't be happy unless they're paying Oracle
$25/power unit anyway.

But being able to spread tables and indices around several spindles
would improve scalability.  I think the very simple approach that's
been kicked around would work for anyone we care about (me!:)

I mentioned the Oracle details in part because it's not clear to me
how much folks here know about Oracle.  I don't know all that much,
only enough to know that any database that initializes its defaults
to useless values is a pain in the ass in ways that customers shouldn't
put up with.  I don't understand Oracle's approach, there, seducing you
into letting it build a default installation which is then virtually
useless.

>We haven't been designing a FreeOracle...

I'm certainly not arguing for this...remember, I did argue against
"(+)" in favor of SQL 92 outer joins :)

>I liked the thread about moving indexes and tables to seperate file
>systems, and hope we can implement something that will make it something
>that does't require 'ln's, but I definitely don't like Oracle's way of
>doing things ...

I agree...I was simply providing a datapoint, not suggesting it was one
Postgres should emulate.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Don Baccus
Date:
At 01:22 AM 1/14/00 -0500, Bruce Momjian wrote:
>> Look - in case there's any doubt, I'm not trying to toast Postgres,
>> I'm a fan, interested in getting more involved in the development
>> scenario.  I raised this issue because Xun raised some "really big
>> database issues which I as a database theorist have an interest in".
>> My biggest sin if any is to try to paint the horizon, at this point.
>> Philip Greenspun still says that those of us (including employee
>> #3 or so of his company, Ben) who are interested in Postgres are "losers"
>> by definition (Ben no longer works there).  Meanwhile, folks like Ben
>
>I thought Phil was a big fan of ours.

He's moderated his opinion considerably over the past several
months.  To some extent you might say he's had his opinion
moderated for him.  Feel free to extrapolate :)

(not just me, or even primarily me, but folks like Ben Adida who'veworked with Philip at MIT and then Ars Digita are
deeplyinterestedin seeing a successful version of the Ars Digita toolkit based onPostgres, Ben also coordinates
AOLserverreleases for Ars Digita/MIT)
 

Still, as recently as six months ago Philip flamed some English gent
in public for suggesting an ACS port to Postgres or another free
or cheap RDBMS, and went on the e-mail the guy nasty notes in private.
Or so the gent sez.

I know Philip was surprised and impressed by the great leap forward
embodied by 6.5.

As was I - I'd given up on 6.4.

But Philip is mostly concerned with the clients that feed his very
rapidly growing company, and while he'll release his toolkit sources
still tells folks you really need Oracle.  His most recent criticism
of Postgres shrunk to two items (not referential integrity, no outer
joins), one of which has disappeared in current sources.  

Apparently he doesn't know how weak large object support is, I won't
tell him, either...

Anyway, this isn't about Philip's opinions so much as the fact that
Postgres has had a very spotty reputation in the past, but is improving
so quickly and predictably that its reputation is also steadily
improving.  He serves as an example of someone who's convinced that
Postgres has greatly improved but remains skeptical that it's improved
enough to do serious work with.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] [hackers]development suggestion needed

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> xun@cs.ucsb.edu (Xun Cheng) writes:
> > I want to experiment with some new fast join algorithms.
> 
> Cool.  Welcome aboard!
> 
> > Could anyone tell me if
> > the directory /docs/pgsql/src/backend/executor is the
> > right place to start
> 
> The executor is only half the problem: you must also teach the
> planner/optimizer how and when to use the new join type.
> 
> Hiroshi Inoue has recently been down this path (adding support
> for TID-based scans), and might be able to give you more specific
> advice.
> 
> > 1. Does postgresql do raw storage device management or it relies
> >    on file system? My impression is no raw device. If no,
> >    is it difficult to add it and possibly how?
> 
> Postgres uses Unix files.  We have avoided raw-device access mostly on
> grounds of portability.  To persuade people that such a change should go
> into the distribution, you'd need to prove that *significantly* better
> performance is obtained with raw access.  I for one don't think it's a
> foregone conclusion; Postgres gets considerable benefit from sitting
> atop Unix kernel device schedulers and disk buffer caches.
> 
> As far as the actual implementation goes, the low level access methods
> go through a "storage manager" switch that was intended to allow for
> the addition of a new storage manager, such as a raw-device manager.
> So you could get a good deal of stuff working by implementing code that
> parallels md.c/fd.c.  The main problem at this point is that there is a
> fair amount of utility code that goes out and does its own manipulation
> of the database file structure.  You'd need to clean that up by pushing
> it all down below the storage manager switch (inventing new storage
> manager calls as needed).
> 
> >    that the available join algos implemented are nested loop
> >    join (including index-based), hash join (which one? hybrid),
> >    sort-merge join?
> 
> Right.  The hash join uses batching if it estimates that the relation
> is too large to fit in memory; is that what you call "hybrid"?

I've heard the word "hybrid" being used of a scheme where you hash each 
key of a multi-key index separately and then concatenate the hashes for 
the index. That way you can use the index for accessing also subsets of 
keys by examining only the buxkets with matching hash sections.

Does postgres do it even when generating the keys ?

I'd guess it does, as each hashable type has a hashing function.

OTOH pg probably does not use it for finding by the 3rd field of index ?

--------
Hannu


Re: [HACKERS] hybrid hash, cont. of development suggestion needed

From
Xun Cheng
Date:
First I want to thank you for all help I got to my
earlier posting.


>> Right.  The hash join uses batching if it estimates that the relation
>> is too large to fit in memory; is that what you call "hybrid"?
>
>I've heard the word "hybrid" being used of a scheme where you hash each 
>key of a multi-key index separately and then concatenate the hashes for 
>the index. That way you can use the index for accessing also subsets of 
>keys by examining only the buxkets with matching hash sections.

In research, there are traditionally three kinds of hash joins:
simple hash, grace hash and hybrid hash. Hybrid is generally considered
to be having a better performance since it is designed to combine
the best behavior of simple hash and grace hash.
It has two phases. In the first the relations are read, hashed into
buckets, and written out, as in grace hash. However, during this phase
a portion of the memory is reserved for an in-memory hash bucket for R (
R is joining with S and R is smaller). This bucket of R will never
be written to disk.

xun



Re: [HACKERS] hybrid hash, cont. of development suggestion needed

From
Tom Lane
Date:
Xun Cheng <xun@cs.ucsb.edu> writes:
> In research, there are traditionally three kinds of hash joins:
> simple hash, grace hash and hybrid hash. Hybrid is generally considered
> to be having a better performance since it is designed to combine
> the best behavior of simple hash and grace hash.
> It has two phases. In the first the relations are read, hashed into
> buckets, and written out, as in grace hash. However, during this phase
> a portion of the memory is reserved for an in-memory hash bucket for R (
> R is joining with S and R is smaller). This bucket of R will never
> be written to disk.

Yes, that's how nodeHash.c does it...
        regards, tom lane