Thread: [hackers]development suggestion needed
Hi, everyone, I like your work very much and hope PostgreSQL can grow into something competitive with Oracle just like Linux vs. Windows. I have background in relational database management system research and I want to try to be a developer for PostgreSQL. Right now I only try to be familiar with your code base. I plan to start with a specific function module in the backend. I'm thinking of /docs/pgsql/src/backend/executor because I want to experiment with some new fast join algorithms. My long term objective is to introduce materialized view subsystem into PostgreSQL. Could anyone tell me if the directory /docs/pgsql/src/backend/executor is the right place to start or just give me some general suggestions which are not in the FAQs? Oh one more thing I want to mention is that those join algorithms I want to experiment with may have some special data access paths similar to an index. Further if it doesn't bother you much, could someone answer the following question(s) for me? (Sorry if some are already in the docs) 1. Does postgresql do raw storage device management or it relies on file system? My impression is no raw device. If no, is it difficult to add it and possibly how? 2. Do you have standard benchmark results for postgresql? I guess not since it only implements a subset of SQL'92. Whatabout subset of a benchmark or something repeatable? 3. Suppose I have added a new two rel. join algorithm, how would I proceed to compare the performance of it with the exisitingtwo relation join algorithms under different senarios? Are there any existing facilities in the current code basefor this purpose? Am I right that the available join algos implemented are nested loop join (including index-based),hash join (which one? hybrid), sort-merge join? 4. Usually a single sequential pass of a large joining relation is preferred to random access in large join operation. It's mostly because of the current disk access characteristics. Is it possible for me to do some benchmarking about this using postgresql? What I'm actually asking are the issues about how to control the flow of data form disk to buffers, how to stop file system interference and how to arrange actual data placement on the disk. Sorry again if I'm not clear with my questions. I'd like to further explain them if necessary. thanks for any help xun
> I have background in relational database management system > research and I want to try to be a developer for PostgreSQL. > Right now I only try to be familiar with your code base. I > plan to start with a specific function module in the backend. > I'm thinking of /docs/pgsql/src/backend/executor because > I want to experiment with some new fast join algorithms. > My long term objective is to introduce materialized view > subsystem into PostgreSQL. Could anyone tell me if > the directory /docs/pgsql/src/backend/executor is the > right place to start or just give me some general suggestions > which are not in the FAQs? Oh one more thing I want to > mention is that those join algorithms I want to experiment > with may have some special data access paths similar to an index. Good. > > Further if it doesn't bother you much, could someone > answer the following question(s) for me? (Sorry if > some are already in the docs) > 1. Does postgresql do raw storage device management or it relies > on file system? My impression is no raw device. If no, > is it difficult to add it and possibly how? No, only file system. We don't see much advantage to raw i/o. > 2. Do you have standard benchmark results for postgresql? > I guess not since it only implements a subset of SQL'92. > What about subset of a benchmark or something repeatable? We do the Wisconsin. I think it is in the source tree. > 3. Suppose I have added a new two rel. join algorithm, how > would I proceed to compare the performance of it with > the exisiting two relation join algorithms under > different senarios? Are there any existing facilities > in the current code base for this purpose? Am I right > that the available join algos implemented are nested loop > join (including index-based), hash join (which one? hybrid), > sort-merge join? You can control the join types used with flags to postgres. Very easy. > 4. Usually a single sequential pass of a large joining relation > is preferred to random access in large join operation. > It's mostly because of the current disk access characteristics. > Is it possible for me to do some benchmarking about this > using postgresql? What I'm actually asking are the issues about > how to control the flow of data form disk to buffers, > how to stop file system interference and how to arrange > actual data placement on the disk. Good idea. We deal with this regularly in deciding to use an index in the optimizer or a sequential scan. Our optimizer is quite good. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
xun@cs.ucsb.edu (Xun Cheng) writes: > I want to experiment with some new fast join algorithms. Cool. Welcome aboard! > Could anyone tell me if > the directory /docs/pgsql/src/backend/executor is the > right place to start The executor is only half the problem: you must also teach the planner/optimizer how and when to use the new join type. Hiroshi Inoue has recently been down this path (adding support for TID-based scans), and might be able to give you more specific advice. > 1. Does postgresql do raw storage device management or it relies > on file system? My impression is no raw device. If no, > is it difficult to add it and possibly how? Postgres uses Unix files. We have avoided raw-device access mostly on grounds of portability. To persuade people that such a change should go into the distribution, you'd need to prove that *significantly* better performance is obtained with raw access. I for one don't think it's a foregone conclusion; Postgres gets considerable benefit from sitting atop Unix kernel device schedulers and disk buffer caches. As far as the actual implementation goes, the low level access methods go through a "storage manager" switch that was intended to allow for the addition of a new storage manager, such as a raw-device manager. So you could get a good deal of stuff working by implementing code that parallels md.c/fd.c. The main problem at this point is that there is a fair amount of utility code that goes out and does its own manipulation of the database file structure. You'd need to clean that up by pushing it all down below the storage manager switch (inventing new storage manager calls as needed). > that the available join algos implemented are nested loop > join (including index-based), hash join (which one? hybrid), > sort-merge join? Right. The hash join uses batching if it estimates that the relation is too large to fit in memory; is that what you call "hybrid"? > 4. Usually a single sequential pass of a large joining relation > is preferred to random access in large join operation. > It's mostly because of the current disk access characteristics. > Is it possible for me to do some benchmarking about this > using postgresql? What I'm actually asking are the issues about > how to control the flow of data form disk to buffers, > how to stop file system interference and how to arrange > actual data placement on the disk. You don't get to do either of the latter two unless you write a raw-device storage manager --- which'd be a fair amount of work for what might be little return. Are you convinced that that ought to be the first thing you work on? I'd be inclined to think about join algorithms in the abstract, without trying to control physical disk placement of the data... regards, tom lane
At 08:23 PM 1/13/00 -0500, Tom Lane wrote: >As far as the actual implementation goes, the low level access methods >go through a "storage manager" switch that was intended to allow for >the addition of a new storage manager, such as a raw-device manager. >So you could get a good deal of stuff working by implementing code that >parallels md.c/fd.c. The main problem at this point is that there is a >fair amount of utility code that goes out and does its own manipulation >of the database file structure. You'd need to clean that up by pushing >it all down below the storage manager switch (inventing new storage >manager calls as needed). This would need to be done to implement some sort of tablespace-style facility, too, right? I'm off Xun's thread in asking but I've been wondering. DBs like Oracle allow you to place tables and indices whereever you like in the filesystem. This is normally done to distribute things across different spindles, and in large, busy databases makes a significant difference. I've done some experimenting moving index files to a different spindle (using "ln" to fool postgres, of course) and insertions go measurably faster. Spindles are so cheap nowadays :) I know there's been discussion of letting folks specify where the WAL will be placed when it's implemented, for safety's sake - it will also improve performance. >You don't get to do either of the latter two unless you write a >raw-device storage manager Not within a single filesystem, but scattering things across spindles could be done without a raw-device storage manager :) (not what he's talking about, but heck, thought I'd raise it) - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Don Baccus <dhogaza@pacifier.com> writes: > This would need to be done to implement some sort of tablespace-style > facility, too, right? I'm off Xun's thread in asking but I've been > wondering. DBs like Oracle allow you to place tables and indices > whereever you like in the filesystem. This is normally done to > distribute things across different spindles, and in large, busy > databases makes a significant difference. I've done some experimenting > moving index files to a different spindle (using "ln" to fool > postgres, of course) and insertions go measurably faster. Spindles > are so cheap nowadays :) As you say, you can fake it manually with symbolic links, but that's a kluge. The "database location" stuff that Peter and Thomas have been arguing about is intended to allow a single postmaster to control databases that are in multiple physical locations --- but there seems to be some debate as to whether it works ;-). (I never tried it.) In any case, we don't currently have any official provision for controlling location at finer than database level. It'd be nice to be able to push individual tables around, I suppose. This wouldn't require a new storage manager, since presumably you'd still be using the Unix-filesystem storage manager. The trick would be to allow a path rather than just a base file name to be specified per-relation. I'm not sure if it'd be hard or not. Probably, all the system tables would have to stay in the database's default directory, but maybe user tables could be given path names without too much trouble... regards, tom lane
At 09:01 PM 1/13/00 -0500, Tom Lane wrote: >As you say, you can fake it manually with symbolic links, but that's >a kluge. Yes, it is. Or worse :) > >The "database location" stuff that Peter and Thomas have been arguing >about is intended to allow a single postmaster to control databases that >are in multiple physical locations --- but there seems to be some debate >as to whether it works ;-). (I never tried it.) In any case, we don't >currently have any official provision for controlling location at finer >than database level. It'd be nice to be able to push individual tables >around, I suppose. Putting indices on different spindles than the tables is known to significantly speed up the Ars Digita Community system under load with Oracle. Systems like this, used to back busy web sites, stuff things into tables many times a second. As I mentioned, I've played around a bit with postgres using "ln" and it does indeed help boost the number of inserts my (paltry, two-spindle) system could sustain. The selects that such sites spew forth are handled wonderfully by Postgres now, with MVCC and the change that stops the update of pg_log after read-only selects. My site's still in the experimental stage, being used by a couple dozen folks to record bird distribution data in the Pacific NW, so I don't personally have real-world data to get a feeling for how important this might become. Still, Oracle DBA docs talk a lot about it so in some real-world scenarios being able to distribute tables and indices on different spindles must pay off. > >This wouldn't require a new storage manager, since presumably you'd >still be using the Unix-filesystem storage manager. The trick would be >to allow a path rather than just a base file name to be specified >per-relation. I'm not sure if it'd be hard or not. Probably, all the >system tables would have to stay in the database's default directory, >but maybe user tables could be given path names without too much >trouble... I've looked into it, actually, and have reached the same conclusion. Including the bit about keeping system tables in the database's default directory. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> As you say, you can fake it manually with symbolic links, but that's > a kluge. > > The "database location" stuff that Peter and Thomas have been arguing > about is intended to allow a single postmaster to control databases that > are in multiple physical locations --- but there seems to be some debate > as to whether it works ;-). (I never tried it.) In any case, we don't > currently have any official provision for controlling location at finer > than database level. It'd be nice to be able to push individual tables > around, I suppose. > > This wouldn't require a new storage manager, since presumably you'd > still be using the Unix-filesystem storage manager. The trick would be > to allow a path rather than just a base file name to be specified > per-relation. I'm not sure if it'd be hard or not. Probably, all the > system tables would have to stay in the database's default directory, > but maybe user tables could be given path names without too much > trouble... Or we could continue to use symlinks, and just create them ourselves in the backend. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> This would need to be done to implement some sort of tablespace-style > facility, too, right? I'm off Xun's thread in asking but I've been > wondering. DBs like Oracle allow you to place tables and indices > whereever you like in the filesystem. This is normally done to > distribute things across different spindles, and in large, busy > databases makes a significant difference. I've done some experimenting > moving index files to a different spindle (using "ln" to fool > postgres, of course) and insertions go measurably faster. Spindles > are so cheap nowadays :) WAL will add the oid to the base file name. That may make tablespaces easier. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> The "database location" stuff that Peter and Thomas have been arguing > about is intended to allow a single postmaster to control databases that > are in multiple physical locations --- but there seems to be some debate > as to whether it works ;-). (I never tried it.) In any case, we don't > currently have any official provision for controlling location at finer > than database level. It'd be nice to be able to push individual tables > around, I suppose. > > This wouldn't require a new storage manager, since presumably you'd > still be using the Unix-filesystem storage manager. The trick would be > to allow a path rather than just a base file name to be specified > per-relation. I'm not sure if it'd be hard or not. Probably, all the > system tables would have to stay in the database's default directory, > but maybe user tables could be given path names without too much > trouble... This is possible since PostgreSQL was born unless I misunderstand what you are saying... test=> create table "/tmp/t1" (i int); CREATE bash$ ls -l /tmp/t1 -rw------- 1 postgres postgres 0 Jan 14 11:19 /tmp/t1 Even, test=> create table "../test2/pg_proc" (i int); ERROR: cannot create ../test2/pg_proc This is not good. Maybe we should prevent to make this kind of table names. BTW, it would be nice to add a "table space" concept to the create table statement. -- reserve a table space named 'foo' which is physically located under -- /pg/myspace. Only PostgreSQL super user can execute this command -- to avoid security risks. create table space foo as '/pg/myspace'; -- create table t1 under /pg/myspace create table t1 (i int) with table space 'foo'; -- Tatsuo Ishii
At 11:42 AM 1/14/00 +0900, Tatsuo Ishii wrote: >This is possible since PostgreSQL was born unless I misunderstand what >you are saying... > >test=> create table "/tmp/t1" (i int); Clever... Of course, when I'm porting over thousands of lines of an Oracle data model and tens of thousands of lines of scripting code that refers to these tables via queries this is a very inconvenient way to locate a particular table in a particular place. It involves changing a lot of code... Besides being somewhat ... baroque? :) >BTW, it would be nice to add a "table space" concept to the create >table statement. I figured you felt that way! > >-- reserve a table space named 'foo' which is physically located under >-- /pg/myspace. Only PostgreSQL super user can execute this command >-- to avoid security risks. >create table space foo as '/pg/myspace'; > >-- create table t1 under /pg/myspace >create table t1 (i int) with table space 'foo'; Yes, that's the Oracle-ish style of it. Of course, Oracle allows all sorts of anal retentive features like allowing a DBA to restrict the size of the tablespace, etc that I personally don't care about... Though I understand why they're important to some. Oracle tables and indices within a single tablespace all live in one file (if you're using filesystem rather than raw I/O), so they also provide features which allow you to specify how big a chunk to allocate per extent (Oracle pre-allocates to avoid running out of disk space while you're running except in ways that you control, and in hopes of getting contiguous chunks of disk storage because they hope you're using previously empty disks used only for Oracle). Features like this don't fit well with each table/index residing in its own file. Personally I don't have any need for them, either, but as Postgres gets more popular (as it will as it continues to improve) it may attract the attention of folks with traditional DBA requirements like this. Of course, that would require a new storage manager, one similar in concept to what would be needed to implement raw I/O. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> Oracle tables and indices within a single tablespace all live in > one file (if you're using filesystem rather than raw I/O), so > they also provide features which allow you to specify how big > a chunk to allocate per extent (Oracle pre-allocates to avoid > running out of disk space while you're running except in ways > that you control, and in hopes of getting contiguous chunks of > disk storage because they hope you're using previously empty > disks used only for Oracle). And with data and index in the same file, you can't split them across spindles. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > Oracle tables and indices within a single tablespace all live in > > one file (if you're using filesystem rather than raw I/O), so > > they also provide features which allow you to specify how big > > a chunk to allocate per extent (Oracle pre-allocates to avoid > > running out of disk space while you're running except in ways > > that you control, and in hopes of getting contiguous chunks of > > disk storage because they hope you're using previously empty > > disks used only for Oracle). > > And with data and index in the same file, you can't split them across > spindles. > But you can certainly do that in ORACLE, if you wish. In fact, ORACLE recommends it: Place Data Files for Maximum Performance Tablespace location is determined by the physical location of the data files that constitute that tablespace. Use the hardware resources of your computer appropriately. For example, if several disk drives are available to store the database, it might be helpful to store table data in a tablespace on one disk drive, and index data in a tablespace on another disk drive. This way, when users query table information, both disk drives can work simultaneously, retrieving table and index data at the same time. Mike Mascari
On Thu, 13 Jan 2000, Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > > This would need to be done to implement some sort of tablespace-style > > facility, too, right? I'm off Xun's thread in asking but I've been > > wondering. DBs like Oracle allow you to place tables and indices > > whereever you like in the filesystem. This is normally done to > > distribute things across different spindles, and in large, busy > > databases makes a significant difference. I've done some experimenting > > moving index files to a different spindle (using "ln" to fool > > postgres, of course) and insertions go measurably faster. Spindles > > are so cheap nowadays :) > > As you say, you can fake it manually with symbolic links, but that's > a kluge. > > The "database location" stuff that Peter and Thomas have been arguing > about is intended to allow a single postmaster to control databases that > are in multiple physical locations --- but there seems to be some debate > as to whether it works ;-). (I never tried it.) In any case, we don't > currently have any official provision for controlling location at finer > than database level. It'd be nice to be able to push individual tables > around, I suppose. > > This wouldn't require a new storage manager, since presumably you'd > still be using the Unix-filesystem storage manager. The trick would be > to allow a path rather than just a base file name to be specified > per-relation. I'm not sure if it'd be hard or not. Probably, all the > system tables would have to stay in the database's default directory, > but maybe user tables could be given path names without too much > trouble... Okay, I've been thinking about this recently with the whole Udmsearch of PostgreSQL. We just put a 9gig drive online to handle this, as well as other database related projects, since I wanted alot of room to grow (PostgreSQL itself indexed out to something like 1gig, and the lists are growing) ... All the major OSs out there have "disk management tools" that allow you to build "large file systems" out of smaller ones... Solaris has DiskSuite, FreeBSD has vinum, Linux has ??... why are we looking/investigating adding a level of complexity to PostgreSQL to handle something that, as far as I know, each of the OSs out there already has a way of dealing with? Some aren't necessarily mature yet...Solaris's is the only one that I'm aware of that has a *beautiful* growfs program that allows you to add a new drive to an existing "pack" and grow the file system into that new drive while the system is live...but the utilities are there... I think the major problem that I'm worried about isn't spreading tables across drives, but its when that *one* table grows to the point that its about to overflow my drive...I'd rather add a 9gig drive on, make it an 18gig file system, and let it continue to grow... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tatsuo Ishii > > > BTW, it would be nice to add a "table space" concept to the create > table statement. > > -- reserve a table space named 'foo' which is physically located under > -- /pg/myspace. Only PostgreSQL super user can execute this command > -- to avoid security risks. > create table space foo as '/pg/myspace'; > > -- create table t1 under /pg/myspace > create table t1 (i int) with table space 'foo'; > -- I agree with Tatsuo though I prefercreate table t1 (i int) tablespace foo; . Isn't it preferable to encapsulate the table location and storage type ? At first,a tablespace would only correspond to a directory and it won't be so difficult to implment. But we would gain a lot with the feature. In the future,the tablespace may be changed to mean real(??) tablespace. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Don Baccus wrote: > My site's still in the experimental stage, being used by a couple > dozen folks to record bird distribution data in the Pacific NW, so > I don't personally have real-world data to get a feeling for how > important this might become. Still, Oracle DBA docs talk a lot > about it so in some real-world scenarios being able to distribute > tables and indices on different spindles must pay off. What would it take to break the data/base/<database> directory down? To something like, maybe: data/base/<database>/pg_* /tables/* /indices/* Then, one could easily mount a drive as /tables and another one as /indices ... We know the difference between a table and an index, so I wouldn't think it would be *too* hard add /tables/ internally to the existing path...would it? You'd basically have somethign like: sprintf("%s/data/base/%s/tables/%s", data_dir, database, tablename); Instead of: sprintf("%s/data/base/%s/%s", data_dir, database, tablename); I know, I'm being simplistic here, but... Or, a different way: if(table) sprintf("%s/data/base/table/%s/%s", data_dir,database,tablename); else if(index) sprintf("%s/data/base/index/%s/%s", data_dir,database,tablename); else sprintf("%s/data/base/sys/%s/%s", data_dir,database,sysfile); This would give you the ability to put all table from all databass onto one file system, and all indexes onto another, and all system files onto a third... I don't know, I'm oversimplying and spewing thoughts out again...but...*shrug* Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Bruce Momjian
Date:
> On Thu, 13 Jan 2000, Don Baccus wrote: > > > My site's still in the experimental stage, being used by a couple > > dozen folks to record bird distribution data in the Pacific NW, so > > I don't personally have real-world data to get a feeling for how > > important this might become. Still, Oracle DBA docs talk a lot > > about it so in some real-world scenarios being able to distribute > > tables and indices on different spindles must pay off. > > What would it take to break the data/base/<database> directory down? To > something like, maybe: > > data/base/<database>/pg_* > /tables/* > /indices/* And put sort and large objects somewhere separate too. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>I agree with Tatsuo though I prefer > create table t1 (i int) tablespace foo; >. >Isn't it preferable to encapsulate the table location and storage type ? Agreed. >At first,a tablespace would only correspond to a directory and it won't >be so difficult to implment. But we would gain a lot with the feature. > >In the future,the tablespace may be changed to mean real(??) >tablespace. Good point. > I think the major problem that I'm worried about isn't spreading tables > across drives, but its when that *one* table grows to the point that its > about to overflow my drive...I'd rather add a 9gig drive on, make it an > 18gig file system, and let it continue to grow... We could extend the create tablespace command something like: create tablespace foo as '/pg/myspace1 /pg/myspace2 '; to spread a table (space) among different disk drives. Moreover we could define the "policy" to use the tablespace: create tablespace foo as '/pg/myspace1 /pg/myspace2 ' policy roundrobin; in above case, if the table hits the 1GB limit in /pg/myspace1 then new segment will be created in /pg/myspace2. Just an idea... -- Tatsuo Ishii
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
The Hermit Hacker
Date:
On Thu, 13 Jan 2000, Bruce Momjian wrote: > > On Thu, 13 Jan 2000, Don Baccus wrote: > > > > > My site's still in the experimental stage, being used by a couple > > > dozen folks to record bird distribution data in the Pacific NW, so > > > I don't personally have real-world data to get a feeling for how > > > important this might become. Still, Oracle DBA docs talk a lot > > > about it so in some real-world scenarios being able to distribute > > > tables and indices on different spindles must pay off. > > > > What would it take to break the data/base/<database> directory down? To > > something like, maybe: > > > > data/base/<database>/pg_* > > /tables/* > > /indices/* > > And put sort and large objects somewhere separate too. why not? by default, one drive, it would make no difference except for file layout, but it would *really* give room to expand... Right now, the udmsearch database contains (approx): tables:10528 dict10 5088 dict11 2608 dict12 3232 dict1664336 dict247960 dict3 3096 dict3265952 dict442944 dict536384 dict634792dict721008 dict814120 dict931912 url indexs: 5216 url_id10 2704 url_id11 1408 url_id12 1648 url_id1636440 url_id227128 url_id3 1032 url_id3237416 url_id422600url_id519096 url_id618248 url_id710880 url_id8 6920 url_id9 6464 word10 3256 word11 1672 word12 2280 word1626344word221200 word3 2704 word3228720 word421880 word519240 word618464 word711952 word8 8864 word9 if tables/indexs were in different subdirectories, it would be too easy for me, at some point in the future, to take just the tables directory and put them on their own dedicated drive, halving the space used on either drive... I don't know...IMHO, it sounds like the simplist solution that provides the multi-spindle benefits ppl are suggesting...
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Don Baccus
Date:
At 12:31 AM 1/14/00 -0400, The Hermit Hacker wrote: >This would give you the ability to put all table from all databass onto >one file system, and all indexes onto another, and all system files onto a >third... > >I don't know, I'm oversimplying and spewing thoughts out >again...but...*shrug* This kind of hack would certainly be doable, but I guess the question arises once again - is PostgreSQL shooting to be the Big Time or not? I mean, the mere use of words like "Bronze Support" and "Silver Support" remind one of Oracle :) This particular issue of the placement of tables and indices pales in importance compared to outer joins, for instance. But once all the big things are implemented, folks doing BIG JOBS will look at Postgres as being a viable alternative. And they'll then be disappointed if they don't have the kind of control over files that they do with Oracle or other big-time commercial DBs...the folks talking about replication are coming from the same space, though much more ambitiously (since a simple tablespace feature could be simple, while I can't think of any simple replication hacks). I'd like to see Postgres succeed in a big way. I don't see it toppling Oracle, but heck I can't see why Interbase can't be ground into dust. Open Source, great functionality, maybe B+ on scalability etc (thus not toppling Oracle but equal to most others) ... that's not too ambitious a goal, is it? - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 01:34 PM 1/14/00 +0900, Hiroshi Inoue wrote: >I agree with Tatsuo though I prefer > create table t1 (i int) tablespace foo; >. >Isn't it preferable to encapsulate the table location and storage type ? Encapsulation is almost always preferable, but I don't think Tatsuo was saying otherwise, merely pointing out a clever trick that I hadn't though of (I didn't realize that "/foo" would be rooted rather than just put under PGDATA, in fact I stay away from quoted non-SQL non-"normal language" identifiers altogether, being somewhat interested in portability of my application code). And it is a clever trick...obvious to an insider, but not to me. >At first,a tablespace would only correspond to a directory and it won't >be so difficult to implment. But we would gain a lot with the feature. >In the future,the tablespace may be changed to mean real(??) >tablespace. I think the decision to keep the mechanism for providing separate storage mangers was a good one (if I understand correctly that there was once consideration of removing it). Even if it is never used in mainstream Postgres, some specialty application may use it, one of the potentials that comes from open source software. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 10:18 PM 1/13/00 -0500, Bruce Momjian wrote: >> Oracle tables and indices within a single tablespace all live in >> one file (if you're using filesystem rather than raw I/O), so >> they also provide features which allow you to specify how big >> a chunk to allocate per extent (Oracle pre-allocates to avoid >> running out of disk space while you're running except in ways >> that you control, and in hopes of getting contiguous chunks of >> disk storage because they hope you're using previously empty >> disks used only for Oracle). > >And with data and index in the same file, you can't split them across >spindles. Which is why folks define more than one tablespace, eh? Something you can't do in PG... Perhaps I didn't make it clear that you can define as many tablespaces as you want? And freely assign any table or index to any tablespace you want? Given my statement above, it is clear you can: 1. Coalesce indices and tables into a single tablespace (the default) if you only define one (again, more or less how Oracledefaults, though I sorta forget because I'm not an Oracle stud and no one in their right mind allows Oracle to setdefaults, because they're always wrong) -or- 2. At the other extreme, you can define as many tablespaces as you have tables and indices, and each can live in theirown separate tablespace (i.e. spindle, if that is what you want to do). -or- 3. Set yourself up at any point between either extreme, according to your own needs. I don't think it's that difficult to understand, is it? - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 12:21 AM 1/14/00 -0400, The Hermit Hacker wrote: >All the major OSs out there have "disk management tools" that allow you to >build "large file systems" out of smaller ones... Solaris has DiskSuite, >FreeBSD has vinum, Linux has ??... why are we looking/investigating adding >a level of complexity to PostgreSQL to handle something that, as far as I >know, each of the OSs out there already has a way of dealing with? If the resident OS can handle the issue, sure, it is worth investigating. Linux today does not (at least, the one I'm running). One godliness-over-utility issue is the fact that doing such things in the operating system ("ln" also works...) kinda violates the RDBMS ideal of having everything about a database, including metadata, stored in the database. In the case of "CREATE TABLESPACE" having Postgres handle placement places the burden of operating system specifics where it belongs - on the implementation. This is why we say things like "integer" or "numeric", come to think of it... The word "portability" comes to mind, though of course things like spindle numbers and the like are extremely variable. >Some aren't necessarily mature yet...Solaris's is the only one that I'm >aware of that has a *beautiful* growfs program that allows you to add a >new drive to an existing "pack" and grow the file system into that new >drive while the system is live...but the utilities are there... > >I think the major problem that I'm worried about isn't spreading tables >across drives, but its when that *one* table grows to the point that its >about to overflow my drive...I'd rather add a 9gig drive on, make it an >18gig file system, and let it continue to grow... These aren't mutally exclusive problems, which is one reason why Oracle allows you to control things so minutely. I think they let you spill a table onto multiple drives, though I'd have to look at one of my manuals hidden in my piles of more interesting things (I'm no Oracle expert, I just read manuals :) There is definitely a sort of tension between the operating systems, which continue to grow in capability such as you're pointing out, and commercial systems like Oracle that have to work TODAY regardless of where operating systems sit on the capability yardstick. Thus Oracle includes built-in mirroring, while today under Linux you might as well do software RAID 1, or you can buy a hardware device that does RAID 1 behind your back and looking like a single drive no matter how many platter you stuff it with, etc etc. So...you'll never hear me argue that Postgres should include a mirroring storage manager. There is no longer the need for an application to do it on its own in the supported OS space (hmmm...I'm assuming FreeBSD is there, too, right?) So maybe the notion of application placement of files on particular spindles is becoming obsolete, too. It isn't today on Linux... - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Don Baccus
Date:
At 11:41 PM 1/13/00 -0500, Bruce Momjian wrote: >And put sort and large objects somewhere separate too. Yeah, very very good point! Though toasted or roasted or toasty or whatever large objects (I unsubscribed for a few crucial days when I was back east travelling around with my girlfriend) might make the large-object issue less important? Also, for sorting, many sites will just load down with RAM, and this will increase rapidly over the next few years (despite current SDRAM high prices and the whole RDRAM fiasco). But really really big sites might appreciate such a feature... - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Don Baccus
Date:
At 12:59 AM 1/14/00 -0400, The Hermit Hacker wrote: >if tables/indexs were in different subdirectories, it would be too easy >for me, at some point in the future, to take just the tables directory and >put them on their own dedicated drive, halving the space used on either >drive... > >I don't know...IMHO, it sounds like the simplist solution that provides >the multi-spindle benefits ppl are suggesting... Splitting tables/indexes seems to be the first-order optimization, from my talking to folks who are far more experienced with databases than I (I did mention I wrote my first query less than a year ago, didn't I?) Still...encapsulation within the RDBMS itself seems to be in the spirit of what RDBMS's are all about...such encapsulation could be expressed in very simple external form and still be useful, but I think it should be encapsulated... Among other things, if CREATE TABLESPACE were dumped by pg_dump, I could move from V7.0 to V8.0 and beyond without having to rebuild my distribution structure by hand :) - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > > xun@cs.ucsb.edu (Xun Cheng) writes: > > I want to experiment with some new fast join algorithms. > > Cool. Welcome aboard! > > > Could anyone tell me if > > the directory /docs/pgsql/src/backend/executor is the > > right place to start > > The executor is only half the problem: you must also teach the > planner/optimizer how and when to use the new join type. > > Hiroshi Inoue has recently been down this path (adding support > for TID-based scans), and might be able to give you more specific > advice. > Hmm,un(??)fortunately,I didn't have to understand details about join to implement Tidscan. It's a basic scan and is used to scan one relation. I don't know about document well either,sorry. AFAIK,Tom is decidedly superior to me in understanding planner/optimizer. Regards. Hiroshi Inoue Inoue@tpf.co.jp
At 10:43 PM 1/13/00 -0500, Mike Mascari wrote: >For example, if several disk drives are available to store the >database, it might be helpful to store table data in a tablespace >on one disk drive, and index data in a tablespace on another disk >drive. My gosh, imagine - I didn't just make this up! The amazement :) > This way, when users query table information, both disk >drives can work simultaneously, retrieving table and index data >at the same time. Overlapped Disk I/O - I remember that term from way back in my (gulp) PDP-8 days, working as a system hacker... Look - in case there's any doubt, I'm not trying to toast Postgres, I'm a fan, interested in getting more involved in the development scenario. I raised this issue because Xun raised some "really big database issues which I as a database theorist have an interest in". My biggest sin if any is to try to paint the horizon, at this point. Philip Greenspun still says that those of us (including employee #3 or so of his company, Ben) who are interested in Postgres are "losers" by definition (Ben no longer works there). Meanwhile, folks like Ben and I keep telling folks that Postgres is growing into an ideal RDBMS for database-backed websites (you know, that place where all money is flowing and will continue to do so tomorrow, though don't ask me about next week? :) And Philip says you're a loser if you won't pay Oracle's license fee. He speaks as a dude badly bitten by Illustra, based long ago on a long-dead version of Postgres but the pain not yet forgotten... Things like the Oracle documentation cited above fall into the class of advice to folks running really big - and REALLY BUSY - database servers. Sure, hardware (cycles, RAM) fallsin price and as time goes on we can perhaps forget some of the more horrific optimization stuff that was invented to deal with small computer systems of one decade ago. As a compiler writer, trust me - I'm familiar with the concept. And with changing pardigms as designs flow from CISC to RISC (oh gosh, not a theoretical advantage but you mean just a cost/performance point on the transistor-per-chip curve? Damn, I should've patented my cynicism 10 years ago!) and back to post-RISC CISC, I'm not about to claim theoretical long-term advantages for any point-of-view. I won't suggest that all of the big-time hacks employed to make old commercial DBs like Oracle are necessary in today's hardware/OS climate (raw I/O in particular falls into that category, IMH-tentative-O) But, still...as long as we've had movable head disk drives (and my first disk drive was head-per-track, believe it or not) minimizing head movement has been an optimization goal. I've written complex scheduling disk drivers in the past, and they can be good. Still, nothing beats coalescing one spindle's I/O into a narrow number of tracks, minimizing head movement. That's a constant that hasn't changed for 30 years, and won't change next week. Heck, it even lowers your downtime due to blown drives. I might also add that the kind of systems Oracle doc writers were thinking of 10 years ago just aren't in the Postgres universe of possible DB clients... But, it is changing. One impact - like it or not - of the good work you folks have done over the past couple of (3 or 4 or I'm not personally sure how much) years and the fact that you continue to push the db into being more and more capable, more and more stable, more and more feature-filled with SQL-92 stuff is that folks being asked to pay $25,000 for a fully-paid up license on a PIII500 X86 system (<$2000 in hardware without even shopping, a greater than 10-1 software to hardware ratio) are going to be looking for a cheaper alternative. Of which you folks are one. So, what's the deal, here...is the goal the Big Time or not? - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Bruce Momjian <pgman@candle.pha.pa.us> writes: > ... > Or we could continue to use symlinks, and just create them ourselves in > the backend. But you'd still need some built-in understanding about where the table is Really Supposed To Be, because you'd need to be able to create and delete the symlinks on the fly when the table grows past a 1-Gb segment boundary (or is shrunk back again by vacuum!). AFAICS, a reasonable solution still requires storing a location path for each table --- so you might as well just use that path directly. regards, tom lane
Re: Multiple Spindles ( Was: Re: [HACKERS][hackers]development suggestion needed )
From
Mike Mascari
Date:
Don Baccus wrote: > I'd like to see Postgres succeed in a big way. I don't see it toppling > Oracle, but heck I can't see why Interbase can't be ground into dust. > Open Source, great functionality, maybe B+ on scalability etc (thus not > toppling Oracle but equal to most others) ... that's not too ambitious > a goal, is it? > Shoot for the sky you hit the eagle, shoot for the eagle you hit the ground.... Mike Mascari
> Look - in case there's any doubt, I'm not trying to toast Postgres, > I'm a fan, interested in getting more involved in the development > scenario. I raised this issue because Xun raised some "really big > database issues which I as a database theorist have an interest in". > My biggest sin if any is to try to paint the horizon, at this point. > Philip Greenspun still says that those of us (including employee > #3 or so of his company, Ben) who are interested in Postgres are "losers" > by definition (Ben no longer works there). Meanwhile, folks like Ben I thought Phil was a big fan of ours. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > ... > > Or we could continue to use symlinks, and just create them ourselves in > > the backend. > > But you'd still need some built-in understanding about where the table > is Really Supposed To Be, because you'd need to be able to create and > delete the symlinks on the fly when the table grows past a 1-Gb segment > boundary (or is shrunk back again by vacuum!). > > AFAICS, a reasonable solution still requires storing a location path > for each table --- so you might as well just use that path directly. Makes sense. The only advantage to symlinks is that you could use that information in places you need it, and for normal access use the symlinks. You wouldn't have to carry around that info as much. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Bruce Momjian
Date:
> I'd like to see Postgres succeed in a big way. I don't see it toppling > Oracle, but heck I can't see why Interbase can't be ground into dust. > Open Source, great functionality, maybe B+ on scalability etc (thus not > toppling Oracle but equal to most others) ... that's not too ambitious > a goal, is it? > We don't block people for working on ietms. However, we do try and set priorities based on the open items we have. I hope we don't appear heavy-handed in this regard. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>You don't get to do either of the latter two unless you write a > >raw-device storage manager > > Not within a single filesystem, but scattering things across spindles > could be done without a raw-device storage manager :) Yes, but seen how cheap RAID arrays have become? I know disks are getting bigger as well, and many people will opt for a single disk, but there may be more urgent things to fix than something for which a hardware solution already exists. And lets face it: a database ought to be on RAID anyway,unless somebody wants to write Tandem-style mirrored disks.... ;-) Adriaan
On Thu, 13 Jan 2000, Don Baccus wrote: > Date: Thu, 13 Jan 2000 18:19:33 -0800 > From: Don Baccus <dhogaza@pacifier.com> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: Xun Cheng <xun@cs.ucsb.edu>, pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] [hackers]development suggestion needed > ...skipped... > The selects that such sites spew forth are handled wonderfully > by Postgres now, with MVCC and the change that stops the update > of pg_log after read-only selects. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Just curious, Does plain 6.5.3 handle read-only selects in this way ? Regards, Oleg > > - Don Baccus, Portland OR <dhogaza@pacifier.com> > Nature photos, on-line guides, Pacific Northwest > Rare Bird Alert Service and other goodies at > http://donb.photo.net. > > ************ > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Don Baccus writes: > At 12:21 AM 1/14/00 -0400, The Hermit Hacker wrote: > > >All the major OSs out there have "disk management tools" that allow you to > >build "large file systems" out of smaller ones... Solaris has DiskSuite, > >FreeBSD has vinum, Linux has ??... why are we looking/investigating adding > >a level of complexity to PostgreSQL to handle something that, as far as I > >know, each of the OSs out there already has a way of dealing with? > > If the resident OS can handle the issue, sure, it is worth investigating. > Linux today does not (at least, the one I'm running). Linux has software raid (often called "md") with most of the usual bells and whistles (RAID0, RAID1, RAID5, RAID0+1, hot spares, background reconstruction). You can patch in LVM (logical volume management) although the distributions of which I am aware don't ship it ready-patched. That's the equivalent of Digi^H^H^H^HTru64 UNIX LSM and AIX and so on have similar things. Basically, join together physical disk units into logical block devices with additions being possible on the fly. If you put an ext2 filesystem on one of those, then you can dynamically resize it with e2resize, although that is not completely production quality yet and last I heard you could currently only increase the filesystem size on the fly and not decrease it. ISTR the competition tend only to allow increase and not shrink but the ext2 one is designed to allow shrink too. The complexity isn't so much in the basics ("simply" throw in more block groups and be carefuly about atomicity if the system is live); it's in stuff like making sure that the system is robust against fragmentation, goal allocation needs tweaking (I think) and how you gather together admin information about where all the bits are. If you break apart the separate disks of a live filesystem, it's nice to know where all the bits go. Even with all that underlying stuff, it's *still* important for higher level configuration at the database level to be possible. Even if from the theoretical point of view it's all one big page space, it matters a great deal in practice to be able to spread different bits out over different table spaces/volumes/files/block devices/whatever. I think that means I'm in violent agreement with you on the db side but this reply does give me a chance to point out that Linux isn't missing the functionality you haven't noticed in it :-). --Malcolm -- Malcolm Beattie <mbeattie@sable.ox.ac.uk> Unix Systems Programmer Oxford University Computing Services
On Thu, 13 Jan 2000, Don Baccus wrote: > So, what's the deal, here...is the goal the Big Time or not? If it means adopting one vendors concept of what the world should look like...not. I *hate* the way Oracle sets up tablespaces ... where i have to pre-guess the size of my data and allocate space accordingly...what if my table never does reach that critical mass? I've just wasted X meg of space ... I hate the way that Oracle starts up something like 4 processes for every database, when the server is started... I think the route we've taken since day one, and hope that we continue along it...we've created "hacks" along the way to appease users of the different vendors for SQL relateed issues, but our underlying structure has stayed, I think, pretty unique... We haven't been designing a FreeOracle...we've been working on and designing a Free, full featured RDBMS that follows the specifications laid down ... with our own PostgreSQLism's thrown in here and there ... If that happens to follow what one vendor happens to have done as far as their implementation, great...but there has been no conscious effort to do so that I'm aware of... Just look at the whole OUTER JOIN issue ... *shrug* I *like* the fact that we come up with original ideas/solutions to problems, that make use of *existing* technologies ... I liked the thread about moving indexes and tables to seperate file systems, and hope we can implement something that will make it something that does't require 'ln's, but I definitely don't like Oracle's way of doing things ... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Oleg Bartunov <oleg@sai.msu.su> writes: >> The selects that such sites spew forth are handled wonderfully >> by Postgres now, with MVCC and the change that stops the update >> of pg_log after read-only selects. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Does plain 6.5.3 handle read-only selects in this way ? AFAIR that logic is in 6.5.*. (Vadim would remember better, since he put it in.) But keep in mind that a SELECT is read-only just to the extent that it is hitting previously committed tuples. The first visit to a newly committed-good or newly committed-dead tuple will cause an update and write-back of the tuple's status flags --- whether that visit happens in SELECT or anything else. It occurs to me that the no-log-update logic could probably be improved on. The test to see whether a log update is needed looks at whether any buffers have been written. A SELECT that marks someone else's tuples as known-committed will look like it needs to be committed in pg_log ... but it doesn't really need it. Perhaps Vadim is planning to fix this in the WAL rewrite. regards, tom lane
> I thought Phil was a big fan of ours. He wants to be. But ihho we are not yet worthy :)) - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
At 12:04 PM 1/14/00 +0300, Oleg Bartunov wrote: >> The selects that such sites spew forth are handled wonderfully >> by Postgres now, with MVCC and the change that stops the update >> of pg_log after read-only selects. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >Just curious, >Does plain 6.5.3 handle read-only selects in this way ? Yes. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 10:35 AM 1/14/00 +0000, Malcolm Beattie wrote: >Linux has software raid (often called "md") with most of the usual >bells and whistles (RAID0, RAID1, RAID5, RAID0+1, hot spares, >background reconstruction). Yes, I know. > You can patch in LVM (logical volume >management) although the distributions of which I am aware don't ship >it ready-patched. Right. >I think that means I'm in violent agreement with you on the db side >but this reply does give me a chance to point out that Linux isn't >missing the functionality you haven't noticed in it :-). Linux also has a journaling filesystem available, if you're brave and wanting to be at the bleeding edge. That's sort of like using next week's Postgres development sources in yesterday's production environment, though - a bit risky :) I should've made it clear that when I was talking about the current, widespread releases that I'm familiar with. RedHat, in particular. The RAID stuff is out-of-the-box supported today, fancier stuff like the journaling file-system will be out-of-the-box supported before too much longer. It's exciting to see Linux improve steadily just as it's exciting to see Postgres improve steadily. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 10:11 AM 1/14/00 -0500, Tom Lane wrote: >Oleg Bartunov <oleg@sai.msu.su> writes: >>> The selects that such sites spew forth are handled wonderfully >>> by Postgres now, with MVCC and the change that stops the update >>> of pg_log after read-only selects. >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> Does plain 6.5.3 handle read-only selects in this way ? > >AFAIR that logic is in 6.5.*. (Vadim would remember better, since he >put it in.) It is. I'd notice right away if it wasn't, the decibel level on my little database server would go 'way up because it went 'way down when I applied the patch to my 6.5 beta. It sits six inches from me so I'd know for sure! >It occurs to me that the no-log-update logic could probably be improved >on. The test to see whether a log update is needed looks at whether any >buffers have been written. A SELECT that marks someone else's tuples as >known-committed will look like it needs to be committed in pg_log >... but it doesn't really need it. Perhaps Vadim is planning to fix >this in the WAL rewrite. No idea if he is or isn't, but the patch is very simple and is based on whether or not buffers got dirty, not whether or not the select itself changed anything, IIRC. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 07:58 AM 1/14/00 +0000, Adriaan Joubert wrote: >>You don't get to do either of the latter two unless you write a > >> >raw-device storage manager >> >> Not within a single filesystem, but scattering things across spindles >> could be done without a raw-device storage manager :) > >Yes, but seen how cheap RAID arrays have become? I know disks are getting >bigger as well, and many people will opt for a single disk, but there may >be more urgent things to fix than something for which a hardware solution >already exists. And lets face it: a database ought to be on RAID >anyway,unless somebody wants to write Tandem-style mirrored disks.... ;-) Don't need to write Tandem-style mirrored disks, the Linux kernal implements mirrored file systems for me. I can mirror UW2 disks in software for $189/spindle (current cost of an IBM Deskstar UW2 7200 RPM 4.5 GB spindle here in Oregon), the fancier RAID arrays still aren't that cheap. The cheapest RAID interfaces just hide the mirroring from you. There's a tier up that take a cluster of mirrored (or RAID 5'd) platters and present them to you as a single large disk - these remove a lot of one's control over spindle placement, sure. My guess is that some folks don't view this as a plus... - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Re: Multiple Spindles ( Was: Re: [HACKERS] [hackers]development suggestion needed )
From
Don Baccus
Date:
At 01:16 AM 1/14/00 -0500, Bruce Momjian wrote: >We don't block people for working on ietms. However, we do try and set >priorities based on the open items we have. I hope we don't appear >heavy-handed in this regard. Not at all. And though I've triggered this dicussion, I'd be the first to agree it is minor in importance compared to things like outer joins, the WAL implementation, and Jan's large object work. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 11:05 AM 1/14/00 -0400, The Hermit Hacker wrote: >On Thu, 13 Jan 2000, Don Baccus wrote: > >> So, what's the deal, here...is the goal the Big Time or not? > >If it means adopting one vendors concept of what the world should look >like...not. >I *hate* the way Oracle sets up tablespaces ... where i have to pre-guess >the size of my data and allocate space accordingly...what if my table >never does reach that critical mass? I've just wasted X meg of space ... And I'm not suggesting that anything like this level of (as I described it in my previous note) anal retentive control be implemented. Anal retentive IT managers won't be happy unless they're paying Oracle $25/power unit anyway. But being able to spread tables and indices around several spindles would improve scalability. I think the very simple approach that's been kicked around would work for anyone we care about (me!:) I mentioned the Oracle details in part because it's not clear to me how much folks here know about Oracle. I don't know all that much, only enough to know that any database that initializes its defaults to useless values is a pain in the ass in ways that customers shouldn't put up with. I don't understand Oracle's approach, there, seducing you into letting it build a default installation which is then virtually useless. >We haven't been designing a FreeOracle... I'm certainly not arguing for this...remember, I did argue against "(+)" in favor of SQL 92 outer joins :) >I liked the thread about moving indexes and tables to seperate file >systems, and hope we can implement something that will make it something >that does't require 'ln's, but I definitely don't like Oracle's way of >doing things ... I agree...I was simply providing a datapoint, not suggesting it was one Postgres should emulate. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 01:22 AM 1/14/00 -0500, Bruce Momjian wrote: >> Look - in case there's any doubt, I'm not trying to toast Postgres, >> I'm a fan, interested in getting more involved in the development >> scenario. I raised this issue because Xun raised some "really big >> database issues which I as a database theorist have an interest in". >> My biggest sin if any is to try to paint the horizon, at this point. >> Philip Greenspun still says that those of us (including employee >> #3 or so of his company, Ben) who are interested in Postgres are "losers" >> by definition (Ben no longer works there). Meanwhile, folks like Ben > >I thought Phil was a big fan of ours. He's moderated his opinion considerably over the past several months. To some extent you might say he's had his opinion moderated for him. Feel free to extrapolate :) (not just me, or even primarily me, but folks like Ben Adida who'veworked with Philip at MIT and then Ars Digita are deeplyinterestedin seeing a successful version of the Ars Digita toolkit based onPostgres, Ben also coordinates AOLserverreleases for Ars Digita/MIT) Still, as recently as six months ago Philip flamed some English gent in public for suggesting an ACS port to Postgres or another free or cheap RDBMS, and went on the e-mail the guy nasty notes in private. Or so the gent sez. I know Philip was surprised and impressed by the great leap forward embodied by 6.5. As was I - I'd given up on 6.4. But Philip is mostly concerned with the clients that feed his very rapidly growing company, and while he'll release his toolkit sources still tells folks you really need Oracle. His most recent criticism of Postgres shrunk to two items (not referential integrity, no outer joins), one of which has disappeared in current sources. Apparently he doesn't know how weak large object support is, I won't tell him, either... Anyway, this isn't about Philip's opinions so much as the fact that Postgres has had a very spotty reputation in the past, but is improving so quickly and predictably that its reputation is also steadily improving. He serves as an example of someone who's convinced that Postgres has greatly improved but remains skeptical that it's improved enough to do serious work with. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Tom Lane wrote: > > xun@cs.ucsb.edu (Xun Cheng) writes: > > I want to experiment with some new fast join algorithms. > > Cool. Welcome aboard! > > > Could anyone tell me if > > the directory /docs/pgsql/src/backend/executor is the > > right place to start > > The executor is only half the problem: you must also teach the > planner/optimizer how and when to use the new join type. > > Hiroshi Inoue has recently been down this path (adding support > for TID-based scans), and might be able to give you more specific > advice. > > > 1. Does postgresql do raw storage device management or it relies > > on file system? My impression is no raw device. If no, > > is it difficult to add it and possibly how? > > Postgres uses Unix files. We have avoided raw-device access mostly on > grounds of portability. To persuade people that such a change should go > into the distribution, you'd need to prove that *significantly* better > performance is obtained with raw access. I for one don't think it's a > foregone conclusion; Postgres gets considerable benefit from sitting > atop Unix kernel device schedulers and disk buffer caches. > > As far as the actual implementation goes, the low level access methods > go through a "storage manager" switch that was intended to allow for > the addition of a new storage manager, such as a raw-device manager. > So you could get a good deal of stuff working by implementing code that > parallels md.c/fd.c. The main problem at this point is that there is a > fair amount of utility code that goes out and does its own manipulation > of the database file structure. You'd need to clean that up by pushing > it all down below the storage manager switch (inventing new storage > manager calls as needed). > > > that the available join algos implemented are nested loop > > join (including index-based), hash join (which one? hybrid), > > sort-merge join? > > Right. The hash join uses batching if it estimates that the relation > is too large to fit in memory; is that what you call "hybrid"? I've heard the word "hybrid" being used of a scheme where you hash each key of a multi-key index separately and then concatenate the hashes for the index. That way you can use the index for accessing also subsets of keys by examining only the buxkets with matching hash sections. Does postgres do it even when generating the keys ? I'd guess it does, as each hashable type has a hashing function. OTOH pg probably does not use it for finding by the 3rd field of index ? -------- Hannu
First I want to thank you for all help I got to my earlier posting. >> Right. The hash join uses batching if it estimates that the relation >> is too large to fit in memory; is that what you call "hybrid"? > >I've heard the word "hybrid" being used of a scheme where you hash each >key of a multi-key index separately and then concatenate the hashes for >the index. That way you can use the index for accessing also subsets of >keys by examining only the buxkets with matching hash sections. In research, there are traditionally three kinds of hash joins: simple hash, grace hash and hybrid hash. Hybrid is generally considered to be having a better performance since it is designed to combine the best behavior of simple hash and grace hash. It has two phases. In the first the relations are read, hashed into buckets, and written out, as in grace hash. However, during this phase a portion of the memory is reserved for an in-memory hash bucket for R ( R is joining with S and R is smaller). This bucket of R will never be written to disk. xun
Xun Cheng <xun@cs.ucsb.edu> writes: > In research, there are traditionally three kinds of hash joins: > simple hash, grace hash and hybrid hash. Hybrid is generally considered > to be having a better performance since it is designed to combine > the best behavior of simple hash and grace hash. > It has two phases. In the first the relations are read, hashed into > buckets, and written out, as in grace hash. However, during this phase > a portion of the memory is reserved for an in-memory hash bucket for R ( > R is joining with S and R is smaller). This bucket of R will never > be written to disk. Yes, that's how nodeHash.c does it... regards, tom lane