Thread: Equivalent praxis to CLUSTERED INDEX?
Coming from the MSSQL world, I'm used to the first step in optimization to be, choose your clustered index and choose it well. I see that PG has a one-shot CLUSTER command, but doesn't support continuously-updated clustered indexes. What I infer from newsgroup browsing is, such an index is impossible, given the MVCC versioning of records (happy to learn I'm wrong). I'd be curious to know what other people, who've crossed this same bridge from MSSQL or Oracle or Sybase to PG, have devised, faced with the same kind of desired performance gain for retrieving blocks of rows with the same partial key.
> I see that PG has a one-shot CLUSTER command, but doesn't support > continuously-updated clustered indexes. > > What I infer from newsgroup browsing is, such an index is impossible, > given the MVCC versioning of records (happy to learn I'm wrong). > > I'd be curious to know what other people, who've crossed this same > bridge from MSSQL or Oracle or Sybase to PG, have devised, > faced with the same kind of desired performance gain for retrieving > blocks of rows with the same partial key. I just run clusterdb each weekend in a cron job... Chris
On Tue, 2004-08-24 at 22:28, Mischa Sandberg wrote: > I see that PG has a one-shot CLUSTER command, but doesn't support > continuously-updated clustered indexes. > > What I infer from newsgroup browsing is, such an index is impossible, > given the MVCC versioning of records (happy to learn I'm wrong). It is possible to have MVCC and ordered/indexed heaps, but it isn't something you can just tack onto the currently supported types -- I looked into this myself. It would take substantial additional code infrastructure to support it, basically an alternative heap system and adding support for tables with odd properties to many parts of the system. Pretty non-trivial. This is probably my #1 "I wish postgres had this feature" feature. It is a serious scalability enhancer for big systems and a pain to work around not having. > I'd be curious to know what other people, who've crossed this same > bridge from MSSQL or Oracle or Sybase to PG, have devised, > faced with the same kind of desired performance gain for retrieving > blocks of rows with the same partial key. The CLUSTER command is often virtually useless for precisely the kinds of tables that need to be clustered. My databases are on-line 24x7, and the tables that are ideal candidates for clustering are in the range of 50-100 million rows. I can afford to lock these tables up for no more than 5-10 minutes during off-peak in the hopes that no one notices, and CLUSTER does not work remotely in the ballpark of that fast for tables of that size. People who can run CLUSTER in a cron job must either have relatively small tables or regular large maintenance windows. My solution, which may or may not work for you, was to write a table partitioning system using the natural flexibility and programmability of postgresql (e.g. table inheritance). From this I automatically get a roughly ordered heap according to the index I would cluster on, with only slightly funky SQL access. The end result works much better with CLUSTER too, though CLUSTER is much less necessary at that point because, at least for my particular purposes, the rows are mostly ordered due to how the data was partitioned. So there are ways to work around CLUSTER, but you'll have to be clever and it will require tailoring the solution to your particular requirements. J. Andrew Rogers
How do vendors actually implement auto-clustering? I assume they move rows around during quiet periods or have lots of empty space in each value bucket. --------------------------------------------------------------------------- J. Andrew Rogers wrote: > On Tue, 2004-08-24 at 22:28, Mischa Sandberg wrote: > > I see that PG has a one-shot CLUSTER command, but doesn't support > > continuously-updated clustered indexes. > > > > What I infer from newsgroup browsing is, such an index is impossible, > > given the MVCC versioning of records (happy to learn I'm wrong). > > > It is possible to have MVCC and ordered/indexed heaps, but it isn't > something you can just tack onto the currently supported types -- I > looked into this myself. It would take substantial additional code > infrastructure to support it, basically an alternative heap system and > adding support for tables with odd properties to many parts of the > system. Pretty non-trivial. > > This is probably my #1 "I wish postgres had this feature" feature. It > is a serious scalability enhancer for big systems and a pain to work > around not having. > > > > I'd be curious to know what other people, who've crossed this same > > bridge from MSSQL or Oracle or Sybase to PG, have devised, > > faced with the same kind of desired performance gain for retrieving > > blocks of rows with the same partial key. > > > The CLUSTER command is often virtually useless for precisely the kinds > of tables that need to be clustered. My databases are on-line 24x7, and > the tables that are ideal candidates for clustering are in the range of > 50-100 million rows. I can afford to lock these tables up for no more > than 5-10 minutes during off-peak in the hopes that no one notices, and > CLUSTER does not work remotely in the ballpark of that fast for tables > of that size. People who can run CLUSTER in a cron job must either have > relatively small tables or regular large maintenance windows. > > > My solution, which may or may not work for you, was to write a table > partitioning system using the natural flexibility and programmability of > postgresql (e.g. table inheritance). From this I automatically get a > roughly ordered heap according to the index I would cluster on, with > only slightly funky SQL access. The end result works much better with > CLUSTER too, though CLUSTER is much less necessary at that point > because, at least for my particular purposes, the rows are mostly > ordered due to how the data was partitioned. > > So there are ways to work around CLUSTER, but you'll have to be clever > and it will require tailoring the solution to your particular > requirements. > > > J. Andrew Rogers > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce, > How do vendors actually implement auto-clustering? I assume they move > rows around during quiet periods or have lots of empty space in each > value bucket. That's how SQL Server does it. In old versions (6.5) you had to manually send commands to update the cluster, same as PG. Also, when you create a cluster (or an index or table for that matter) you can manually set an amount of "space" to be held open on each data page for updates. Also keep in mind that SQL Server, as a "single-user database" has a much easier time with this. They don't have to hold several versions of an index in memory and collapse it into a single version at commit time. All that being said, we could do a better job of "auto-balancing" clustered tables. I believe that someone was working on this in Hackers through what they called "B-Tree Tables". What happened to that? -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, 2004-08-26 at 11:18, Bruce Momjian wrote: > How do vendors actually implement auto-clustering? I assume they move > rows around during quiet periods or have lots of empty space in each > value bucket. As far as I know, Oracle does it by having a B-Tree organized heap (a feature introduced around v8 IIRC), basically making the primary key index and the heap the same physical structure. Any non-index columns are stored in the index along with the index columns. Implementing it is slightly weird because searching the index and selecting the rows from the heap are not separate operations. The major caveat to having tables of this type is that you can only have a primary key index. No other indexes are possible because the "heap" constantly undergoes local reorganizations if you have a lot of write traffic, the same kind of reorganization you would normally expect in a BTree index. The performance improvements come from two optimizations. First, you have to touch significantly fewer blocks to get all the rows, even compared to a CLUSTERed heap. Second, the footprint is smaller and plays nicely with the buffer cache. When I've used these types of heaps in Oracle 8 on heavily used tables with tens of millions of rows, we frequently got a 10x or better performance improvement on queries against those tables. It is only really useful for tables with vast quantities of relatively small rows, but it can be a lifesaver in those cases. J. Andrew Rogers
>> How do vendors actually implement auto-clustering? I assume >they move >> rows around during quiet periods or have lots of empty space in each >> value bucket. > > >As far as I know, Oracle does it by having a B-Tree organized heap (a >feature introduced around v8 IIRC), basically making the primary key >index and the heap the same physical structure. Any non-index columns >are stored in the index along with the index columns. Implementing it >is slightly weird because searching the index and selecting the rows >from the heap are not separate operations. Almost the same for MSSQL. The clustered index is always forced unique. If you create a non-unique clustered index, SQLServer will internally pad it with random (or is it sequential? Can't remember right now) data to make each key unique. The clustered index contains all the data fields - both the index key and the other columns from the database. It does support non-clustered indexes as well on the same table. Any "secondary index" will then contain the index key and the primary key value. This means a lookup in a non-clustered index means a two-step index lookup: First look in the non-clustered index for the clustered key. Then look in the clustered index for the rest of the data. Naturally a non-clustered index needs better selectivity before it's actually used than a clustered index does. IIRC, SQL Server always creates clustered indexes by default for primary keys. //Magnus
Magnus, > IIRC, SQL Server always creates clustered indexes by default for primary > keys. I think that's a per-database setting; certainly the ones I admin do not. However, since SQL Server orders its data pages, those data pages tend to be in the order of the primary key regardless if there is no clustered index. -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, 2004-08-26 at 12:30, Magnus Hagander wrote: > Almost the same for MSSQL. The clustered index is always forced unique. > If you create a non-unique clustered index, SQLServer will internally > pad it with random (or is it sequential? Can't remember right now) data > to make each key unique. The clustered index contains all the data > fields - both the index key and the other columns from the database. > > It does support non-clustered indexes as well on the same table. Any > "secondary index" will then contain the index key and the primary key > value. This means a lookup in a non-clustered index means a two-step > index lookup: First look in the non-clustered index for the clustered > key. Then look in the clustered index for the rest of the data. Ah, okay. I see how that would work for a secondary index, though it would make for a slow secondary index. Neat workaround. For all I know, current versions of Oracle may support secondary indexes on index-organized tables; all this Postgres usage over the last couple years has made my Oracle knowledge rusty. > IIRC, SQL Server always creates clustered indexes by default for primary > keys. That would surprise me actually. For some types of tables, e.g. ones with multiple well-used indexes or large rows, index-organizing the heap could easily give worse performance than a normal index/heap pair depending on access patterns. It also tends to be more prone to having locking contention under some access patterns. This is one of those options that needs to be used knowledgeably; it is not a general architectural improvement that you would want to apply to every table all the time. J. Andrew Rogers
Bruce Momjian wrote: > How do vendors actually implement auto-clustering? I assume they move > rows around during quiet periods or have lots of empty space in each > value bucket. > > --------------------------------------------------------------------------- IIRC informix doesn't have it, and you have to recluster periodically the table. After having clustered the table with an index in order to recluster the table with another index you have to release the previous one ( ALTER index TO NOT CLUSTER ), the CLUSTER is an index attribute and each table can have only one index with that attribute ON. Regards Gaetano Mendola
FWIW, Informix does allow the fragmentation of data over named dbspaces by round-robin and expression; this is autosupporting aslong as the dba keeps enough space available. You may also fragment the index although there are some variations dependingon type of Informix (XPS, etc.); this is available in at least 9.3 ... I have never used the index fragmentationas its own beast, but the fragmenting of data works like a charm for spreadling load over more disks. Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: Gaetano Mendola [mailto:mendola@bigfoot.com] Sent: Thursday, August 26, 2004 2:10 PM To: Bruce Momjian; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX? Bruce Momjian wrote: > How do vendors actually implement auto-clustering? I assume they move > rows around during quiet periods or have lots of empty space in each > value bucket. > > --------------------------------------------------------------------------- IIRC informix doesn't have it, and you have to recluster periodically the table. After having clustered the table with an index in order to recluster the table with another index you have to release the previous one ( ALTER index TO NOT CLUSTER ), the CLUSTER is an index attribute and each table can have only one index with that attribute ON. Regards Gaetano Mendola ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Updated TODO item: o Automatically maintain clustering on a table This would require some background daemon to maintain clustering during periods of low usage. It might also require tables to be only paritally filled for easier reorganization. It also might require creating a merged heap/index data file so an index lookup would automatically access the heap data too. --------------------------------------------------------------------------- J. Andrew Rogers wrote: > On Thu, 2004-08-26 at 12:30, Magnus Hagander wrote: > > Almost the same for MSSQL. The clustered index is always forced unique. > > If you create a non-unique clustered index, SQLServer will internally > > pad it with random (or is it sequential? Can't remember right now) data > > to make each key unique. The clustered index contains all the data > > fields - both the index key and the other columns from the database. > > > > It does support non-clustered indexes as well on the same table. Any > > "secondary index" will then contain the index key and the primary key > > value. This means a lookup in a non-clustered index means a two-step > > index lookup: First look in the non-clustered index for the clustered > > key. Then look in the clustered index for the rest of the data. > > > Ah, okay. I see how that would work for a secondary index, though it > would make for a slow secondary index. Neat workaround. For all I > know, current versions of Oracle may support secondary indexes on > index-organized tables; all this Postgres usage over the last couple > years has made my Oracle knowledge rusty. > > > > IIRC, SQL Server always creates clustered indexes by default for primary > > keys. > > > That would surprise me actually. For some types of tables, e.g. ones > with multiple well-used indexes or large rows, index-organizing the heap > could easily give worse performance than a normal index/heap pair > depending on access patterns. It also tends to be more prone to having > locking contention under some access patterns. This is one of those > options that needs to be used knowledgeably; it is not a general > architectural improvement that you would want to apply to every table > all the time. > > > J. Andrew Rogers > > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Updated TODO item: > > o Automatically maintain clustering on a table > > This would require some background daemon to maintain clustering > during periods of low usage. It might also require tables to be only > paritally filled for easier reorganization. It also might require > creating a merged heap/index data file so an index lookup would > automatically access the heap data too. Fwiw, I would say the first "would" is also a "might". None of the previous discussions here presumed a maintenance daemon. The discussions before talked about a mechanism to try to place new tuples as close as possible to the proper index position. I would also suggest making some distinction between a cluster system similar to what we have now but improved to maintain the clustering continuously, and an actual index-organized-table where the tuples are actually only stored in a btree structure. They're two different approaches to similar problems. But they might both be useful to have, and have markedly different implementation details. -- greg
OK, new wording: o Automatically maintain clustering on a table This might require some background daemon to maintain clustering during periods of low usage. It might also require tables to be only paritally filled for easier reorganization. Another idea would be to create a merged heap/index data file so an index lookup would automatically access the heap data too. --------------------------------------------------------------------------- Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Updated TODO item: > > > > o Automatically maintain clustering on a table > > > > This would require some background daemon to maintain clustering > > during periods of low usage. It might also require tables to be only > > paritally filled for easier reorganization. It also might require > > creating a merged heap/index data file so an index lookup would > > automatically access the heap data too. > > Fwiw, I would say the first "would" is also a "might". None of the previous > discussions here presumed a maintenance daemon. The discussions before talked > about a mechanism to try to place new tuples as close as possible to the > proper index position. > > I would also suggest making some distinction between a cluster system similar > to what we have now but improved to maintain the clustering continuously, and > an actual index-organized-table where the tuples are actually only stored in a > btree structure. > > They're two different approaches to similar problems. But they might both be > useful to have, and have markedly different implementation details. > > -- > greg > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Greg Stark wrote: > The discussions before talked about a mechanism to try to place new > tuples as close as possible to the proper index position. Means this that an index shall have a "fill factor" property, similar to Informix one ? From the manual: The FILLFACTOR option takes effect only when you build an index on a table that contains more than 5,000 rows and uses more than 100 table pages, when you create an index on a fragmented table, or when you create a fragmented index on a nonfragmented table. Use the FILLFACTOR option to provide for expansion of an index at a later date or to create compacted indexes. When the index is created, the database server initially fills only that percentage of the nodes specified with the FILLFACTOR value. # Providing a Low Percentage Value If you provide a low percentage value, such as 50, you allow room for growth in your index. The nodes of the index initially fill to a certain percentage and contain space for inserts. The amount of available space depends on the number of keys in each page as well as the percentage value. For example, with a 50-percent FILLFACTOR value, the page would be half full and could accommodate doubling in size. A low percentage value can result in faster inserts and can be used for indexes that you expect to grow. # Providing a High Percentage Value If you provide a high percentage value, such as 99, your indexes are compacted, and any new index inserts result in splitting nodes. The maximum density is achieved with 100 percent. With a 100-percent FILLFACTOR value, the index has no room available for growth; any additions to the index result in splitting the nodes. A 99-percent FILLFACTOR value allows room for at least one insertion per node. A high percentage value can result in faster selects and can be used for indexes that you do not expect to grow or for mostly read-only indexes. Regards Gaetano Mendola
I had FILLFACTOR in the TODO list until just a few months ago, but because no one had discussed it in 3-4 years, I removed the item. I have added mention now in the auto-cluster section because that actually seems like the only good reason for a non-100% fillfactor. I don't think our ordinary btrees have enough of a penalty for splits to make a non-full fillfactor worthwhile, but having a non-full fillfactor for autocluster controls how often items have to be shifted around. --------------------------------------------------------------------------- Gaetano Mendola wrote: > Greg Stark wrote: > > > The discussions before talked about a mechanism to try to place new > > tuples as close as possible to the proper index position. > > Means this that an index shall have a "fill factor" property, similar to > Informix one ? > > From the manual: > > > The FILLFACTOR option takes effect only when you build an index on a table > that contains more than 5,000 rows and uses more than 100 table pages, when > you create an index on a fragmented table, or when you create a fragmented > index on a nonfragmented table. > Use the FILLFACTOR option to provide for expansion of an index at a later > date or to create compacted indexes. > When the index is created, the database server initially fills only that > percentage of the nodes specified with the FILLFACTOR value. > > # Providing a Low Percentage Value > If you provide a low percentage value, such as 50, you allow room for growth > in your index. The nodes of the index initially fill to a certain percentage and > contain space for inserts. The amount of available space depends on the > number of keys in each page as well as the percentage value. > For example, with a 50-percent FILLFACTOR value, the page would be half > full and could accommodate doubling in size. A low percentage value can > result in faster inserts and can be used for indexes that you expect to grow. > > > # Providing a High Percentage Value > If you provide a high percentage value, such as 99, your indexes are > compacted, and any new index inserts result in splitting nodes. The > maximum density is achieved with 100 percent. With a 100-percent > FILLFACTOR value, the index has no room available for growth; any > additions to the index result in splitting the nodes. > A 99-percent FILLFACTOR value allows room for at least one insertion per > node. A high percentage value can result in faster selects and can be used for > indexes that you do not expect to grow or for mostly read-only indexes. > > > > > Regards > Gaetano Mendola > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce, What happened to the B-Tree Table patch discussed on Hackers ad nauseum last winter? -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Bruce, > > What happened to the B-Tree Table patch discussed on Hackers ad nauseum last > winter? I don't remember that. The only issue I remember is sorting btree index by heap tid on creation. We eventually got that into CVS for 8.0. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Greetings, I am not sure if this applies only to clustering but for storage in general, IIRC Oracle has 2 parameters that can be set at table creation : from Oracle docs PCTFREE integer : Specify the percentage of space in each data block of the table, object table OID index, or partition reserved for future updates to the table's rows. The value of PCTFREE must be a value from 0 to 99. A value of 0 allows the entire block to be filled by inserts of new rows. The default value is 10. This value reserves 10% of each block for updates to existing rows and allows inserts of new rows to fill a maximum of 90% of each block. PCTFREE has the same function in the PARTITION description and in the statements that create and alter clusters, indexes, materialized views, and materialized view logs. The combination of PCTFREE and PCTUSED determines whether new rows will be inserted into existing data blocks or into new blocks. PCTUSED integer Specify the minimum percentage of used space that Oracle maintains for each data block of the table, object table OID index, or index-organized table overflow data segment. A block becomes a candidate for row insertion when its used space falls below PCTUSED. PCTUSED is specified as a positive integer from 0 to 99 and defaults to 40. PCTUSED has the same function in the PARTITION description and in the statements that create and alter clusters, materialized views, and materialized view logs. PCTUSED is not a valid table storage characteristic for an index-organized table (ORGANIZATION INDEX). The sum of PCTFREE and PCTUSED must be equal to or less than 100. You can use PCTFREE and PCTUSED together to utilize space within a table more efficiently. PostgreSQL could take some hints from the above. On Aug 27, 2004, at 1:26 AM, Gaetano Mendola wrote: > Greg Stark wrote: > >> The discussions before talked about a mechanism to try to place new > > tuples as close as possible to the proper index position. > > Means this that an index shall have a "fill factor" property, similar > to > Informix one ? > > From the manual: > > > The FILLFACTOR option takes effect only when you build an index on a > table > that contains more than 5,000 rows and uses more than 100 table pages, > when > you create an index on a fragmented table, or when you create a > fragmented > index on a nonfragmented table. > Use the FILLFACTOR option to provide for expansion of an index at a > later > date or to create compacted indexes. > When the index is created, the database server initially fills only > that > percentage of the nodes specified with the FILLFACTOR value. > > # Providing a Low Percentage Value > If you provide a low percentage value, such as 50, you allow room for > growth > in your index. The nodes of the index initially fill to a certain > percentage and > contain space for inserts. The amount of available space depends on the > number of keys in each page as well as the percentage value. > For example, with a 50-percent FILLFACTOR value, the page would be half > full and could accommodate doubling in size. A low percentage value can > result in faster inserts and can be used for indexes that you expect > to grow. > > > # Providing a High Percentage Value > If you provide a high percentage value, such as 99, your indexes are > compacted, and any new index inserts result in splitting nodes. The > maximum density is achieved with 100 percent. With a 100-percent > FILLFACTOR value, the index has no room available for growth; any > additions to the index result in splitting the nodes. > A 99-percent FILLFACTOR value allows room for at least one insertion > per > node. A high percentage value can result in faster selects and can be > used for > indexes that you do not expect to grow or for mostly read-only indexes. > > > > > Regards > Gaetano Mendola > > > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > > -- Adi Alurkar (DBA sf.NET) <adi@vasoftware.com> 1024D/79730470 A491 5724 74DE 956D 06CB D844 6DF1 B972 7973 0470
But what is the advantage of non-full pages in Oracle? --------------------------------------------------------------------------- Adi Alurkar wrote: > Greetings, > > I am not sure if this applies only to clustering but for storage in > general, > > IIRC Oracle has 2 parameters that can be set at table creation : > from Oracle docs > > PCTFREE integer : > Specify the percentage of space in each data block of the table, object > table OID index, or partition reserved for future updates to the > table's rows. The value of PCTFREE must be a value from 0 to 99. A > value of 0 allows the entire block to be filled by inserts of new rows. > The default value is 10. This value reserves 10% of each block for > updates to existing rows and allows inserts of new rows to fill a > maximum of 90% of each block. > PCTFREE has the same function in the PARTITION description and in the > statements that create and alter clusters, indexes, materialized views, > and materialized view logs. The combination of PCTFREE and PCTUSED > determines whether new rows will be inserted into existing data blocks > or into new blocks. > > PCTUSED integer > Specify the minimum percentage of used space that Oracle maintains for > each data block of the table, object table OID index, or > index-organized table overflow data segment. A block becomes a > candidate for row insertion when its used space falls below PCTUSED. > PCTUSED is specified as a positive integer from 0 to 99 and defaults to > 40. > PCTUSED has the same function in the PARTITION description and in the > statements that create and alter clusters, materialized views, and > materialized view logs. > PCTUSED is not a valid table storage characteristic for an > index-organized table (ORGANIZATION INDEX). > The sum of PCTFREE and PCTUSED must be equal to or less than 100. You > can use PCTFREE and PCTUSED together to utilize space within a table > more efficiently. > > PostgreSQL could take some hints from the above. > > On Aug 27, 2004, at 1:26 AM, Gaetano Mendola wrote: > > > Greg Stark wrote: > > > >> The discussions before talked about a mechanism to try to place new > > > tuples as close as possible to the proper index position. > > > > Means this that an index shall have a "fill factor" property, similar > > to > > Informix one ? > > > > From the manual: > > > > > > The FILLFACTOR option takes effect only when you build an index on a > > table > > that contains more than 5,000 rows and uses more than 100 table pages, > > when > > you create an index on a fragmented table, or when you create a > > fragmented > > index on a nonfragmented table. > > Use the FILLFACTOR option to provide for expansion of an index at a > > later > > date or to create compacted indexes. > > When the index is created, the database server initially fills only > > that > > percentage of the nodes specified with the FILLFACTOR value. > > > > # Providing a Low Percentage Value > > If you provide a low percentage value, such as 50, you allow room for > > growth > > in your index. The nodes of the index initially fill to a certain > > percentage and > > contain space for inserts. The amount of available space depends on the > > number of keys in each page as well as the percentage value. > > For example, with a 50-percent FILLFACTOR value, the page would be half > > full and could accommodate doubling in size. A low percentage value can > > result in faster inserts and can be used for indexes that you expect > > to grow. > > > > > > # Providing a High Percentage Value > > If you provide a high percentage value, such as 99, your indexes are > > compacted, and any new index inserts result in splitting nodes. The > > maximum density is achieved with 100 percent. With a 100-percent > > FILLFACTOR value, the index has no room available for growth; any > > additions to the index result in splitting the nodes. > > A 99-percent FILLFACTOR value allows room for at least one insertion > > per > > node. A high percentage value can result in faster selects and can be > > used for > > indexes that you do not expect to grow or for mostly read-only indexes. > > > > > > > > > > Regards > > Gaetano Mendola > > > > > > > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > > > > > -- > Adi Alurkar (DBA sf.NET) <adi@vasoftware.com> > 1024D/79730470 A491 5724 74DE 956D 06CB D844 6DF1 B972 7973 0470 > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Bruce Momjian > Sent: Friday, August 27, 2004 1:27 PM > To: Adi Alurkar > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX? > > > > But what is the advantage of non-full pages in Oracle? > One advantage has to do with updates of variable-length columns, e.g. varchars. If the block is fully packed with data, an update to a varchar column that makes the column wider, causes "row-chaining". This means that a portion of the row is stored in a different data block, which may be somewhere completely different in the storage array. Retrieving that row (or even just that column from that row) as a unit may now require additional disk seek(s). Leaving some space for updates in each data block doesn't prevent this problem completely, but mitigates it to a certain extent. If for instance a row is typically inserted with a null value for a varchar column, but the application developer knows it will almost always get updated with some value later on, then leaving a certain percentage of empty space in each block allocated to that table makes sense. Conversely, if you know that your data is never going to get updated (e.g. a data warehousing application), you might specify to pack the blocks as full as possible. This makes for the most efficient data retrieval performance. - Jeremy
IIRC it it to reduce the "overflow" of data or what oracle calls chained rows. i.e if a table has variable length columns and 10 rows get inserted into a datapage, if this datapage is full and one of the variable length field gets updated the row will now "overflow" into another datapage, but if the datapage is created with an appropriate amount of free space the updated row will be stored in one single datapage. On Aug 27, 2004, at 10:27 AM, Bruce Momjian wrote: > > But what is the advantage of non-full pages in Oracle? > > ----------------------------------------------------------------------- > ---- > > Adi Alurkar wrote: >> Greetings, >> >> I am not sure if this applies only to clustering but for storage in >> general, >> >> IIRC Oracle has 2 parameters that can be set at table creation : >> from Oracle docs >> >> PCTFREE integer : >> Specify the percentage of space in each data block of the table, >> object >> table OID index, or partition reserved for future updates to the >> table's rows. The value of PCTFREE must be a value from 0 to 99. A >> value of 0 allows the entire block to be filled by inserts of new >> rows. >> The default value is 10. This value reserves 10% of each block for >> updates to existing rows and allows inserts of new rows to fill a >> maximum of 90% of each block. >> PCTFREE has the same function in the PARTITION description and in the >> statements that create and alter clusters, indexes, materialized >> views, >> and materialized view logs. The combination of PCTFREE and PCTUSED >> determines whether new rows will be inserted into existing data blocks >> or into new blocks. >> >> PCTUSED integer >> Specify the minimum percentage of used space that Oracle maintains for >> each data block of the table, object table OID index, or >> index-organized table overflow data segment. A block becomes a >> candidate for row insertion when its used space falls below PCTUSED. >> PCTUSED is specified as a positive integer from 0 to 99 and defaults >> to >> 40. >> PCTUSED has the same function in the PARTITION description and in the >> statements that create and alter clusters, materialized views, and >> materialized view logs. >> PCTUSED is not a valid table storage characteristic for an >> index-organized table (ORGANIZATION INDEX). >> The sum of PCTFREE and PCTUSED must be equal to or less than 100. You >> can use PCTFREE and PCTUSED together to utilize space within a table >> more efficiently. >> >> PostgreSQL could take some hints from the above. >> >> On Aug 27, 2004, at 1:26 AM, Gaetano Mendola wrote: >> >>> Greg Stark wrote: >>> >>>> The discussions before talked about a mechanism to try to place new >>>> tuples as close as possible to the proper index position. >>> >>> Means this that an index shall have a "fill factor" property, similar >>> to >>> Informix one ? >>> >>> From the manual: >>> >>> >>> The FILLFACTOR option takes effect only when you build an index on a >>> table >>> that contains more than 5,000 rows and uses more than 100 table >>> pages, >>> when >>> you create an index on a fragmented table, or when you create a >>> fragmented >>> index on a nonfragmented table. >>> Use the FILLFACTOR option to provide for expansion of an index at a >>> later >>> date or to create compacted indexes. >>> When the index is created, the database server initially fills only >>> that >>> percentage of the nodes specified with the FILLFACTOR value. >>> >>> # Providing a Low Percentage Value >>> If you provide a low percentage value, such as 50, you allow room for >>> growth >>> in your index. The nodes of the index initially fill to a certain >>> percentage and >>> contain space for inserts. The amount of available space depends on >>> the >>> number of keys in each page as well as the percentage value. >>> For example, with a 50-percent FILLFACTOR value, the page would be >>> half >>> full and could accommodate doubling in size. A low percentage value >>> can >>> result in faster inserts and can be used for indexes that you expect >>> to grow. >>> >>> >>> # Providing a High Percentage Value >>> If you provide a high percentage value, such as 99, your indexes are >>> compacted, and any new index inserts result in splitting nodes. The >>> maximum density is achieved with 100 percent. With a 100-percent >>> FILLFACTOR value, the index has no room available for growth; any >>> additions to the index result in splitting the nodes. >>> A 99-percent FILLFACTOR value allows room for at least one insertion >>> per >>> node. A high percentage value can result in faster selects and can be >>> used for >>> indexes that you do not expect to grow or for mostly read-only >>> indexes. >>> >>> >>> >>> >>> Regards >>> Gaetano Mendola >>> >>> >>> >>> >>> >>> ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 7: don't forget to increase your free space map settings >>> >>> >> -- >> Adi Alurkar (DBA sf.NET) <adi@vasoftware.com> >> 1024D/79730470 A491 5724 74DE 956D 06CB D844 6DF1 B972 7973 0470 >> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 6: Have you searched our list archives? >> >> http://archives.postgresql.org >> > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania > 19073 > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match > > -- Adi Alurkar (DBA sf.NET) <adi@vasoftware.com> 1024D/79730470 A491 5724 74DE 956D 06CB D844 6DF1 B972 7973 0470
Adi Alurkar wrote: > IIRC it it to reduce the "overflow" of data or what oracle calls > chained rows. i.e if a table has variable length columns and 10 rows > get inserted into a datapage, if this datapage is full and one of the > variable length field gets updated the row will now "overflow" into > another datapage, but if the datapage is created with an appropriate > amount of free space the updated row will be stored in one single > datapage. Agreed. What I am wondering is with our system where every update gets a new row, how would this help us? I know we try to keep an update on the same row as the original, but is there any significant performance benefit to doing that which would offset the compaction advantage? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Agreed. What I am wondering is with our system where every update gets > a new row, how would this help us? I know we try to keep an update on > the same row as the original, but is there any significant performance > benefit to doing that which would offset the compaction advantage? Because Oracle uses overwrite-in-place (undoing from an UNDO log on transaction abort), while we always write a whole new row, it would take much larger PCTFREE wastage to get a useful benefit in PG than it does in Oracle. That wastage translates directly into increased I/O costs, so I'm a bit dubious that we should assume there is a win to be had here just because Oracle offers the feature. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Agreed. What I am wondering is with our system where every update gets > a new row, how would this help us? I know we try to keep an update on > the same row as the original, but is there any significant performance > benefit to doing that which would offset the compaction advantage? Hm. Posit a system where all transactions are short updates executed in autocommit mode. In such a system as soon as a transaction commits it would take a very short time before the previous record was a dead tuple. If every backend kept a small list of tuples it had marked deleted and whenever it was idle checked to see if they were dead yet, it might avoid much of the need for vacuum. And in such a circumstance I think you wouldn't need more than a pctfree of 50% even on a busy table. Every tuple would need about one extra slot. This would only be a reasonable idea if a) if the list of potential dead tuples is short and if it overflows it just forgets them leaving them for vacuum to deal with. and b) It only checks the potentially dead tuples when the backend is otherwise idle. Even so it would be less efficient than a batch vacuum, and it would be taking up i/o bandwidth (to maintain indexes even if the heap buffer is in ram), even if that backend is idle it doesn't mean other backends couldn't have used that i/o bandwidth. But I think it would deal with a lot of the complaints about vacuum and it would make it more feasible to use a pctfree parameter to make clustering more effective. -- greg
Bruce Momjian <pgman@candle.pha.pa.us> writes: > but is there any significant performance benefit to doing that which would > offset the compaction advantage? Just as a side comment. Setting PCTFREE 0 PCTUSED 100 on tables that have no updates on them has an astonishingly big effect on speed. So the penalty for leaving some space free really is substantial. I think the other poster is right. Oracle really needs pctfree because of the way it handles updates. Postgres doesn't really need as much because it doesn't try to squeeze the new tuple in the space the old one took up. If it doesn't fit on the page the worst that happens is it has to store it on some other page, whereas oracle has to do its strange row chaining thing. -- greg
Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > but is there any significant performance benefit to doing that which would > > offset the compaction advantage? > > Just as a side comment. Setting PCTFREE 0 PCTUSED 100 on tables that have no > updates on them has an astonishingly big effect on speed. So the penalty for > leaving some space free really is substantial. > > I think the other poster is right. Oracle really needs pctfree because of the > way it handles updates. Postgres doesn't really need as much because it > doesn't try to squeeze the new tuple in the space the old one took up. If it > doesn't fit on the page the worst that happens is it has to store it on some > other page, whereas oracle has to do its strange row chaining thing. Oracle also does that chain thing so moving updates to different pages might have more of an impact than it does on PostgreSQL. We have chains too but just for locking. Not sure on Oracle. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>Agreed. What I am wondering is with our system where every update gets >>a new row, how would this help us? I know we try to keep an update on >>the same row as the original, but is there any significant performance >>benefit to doing that which would offset the compaction advantage? > > > Because Oracle uses overwrite-in-place (undoing from an UNDO log on > transaction abort), while we always write a whole new row, it would take > much larger PCTFREE wastage to get a useful benefit in PG than it does > in Oracle. That wastage translates directly into increased I/O costs, > so I'm a bit dubious that we should assume there is a win to be had here > just because Oracle offers the feature. Mmmm. Consider this scenario: ctid datas (0,1) yyy-xxxxxxxxxxxxxxxxxxx (0,2) -------- EMPTY -------- (0,3) -------- EMPTY -------- (0,4) -------- EMPTY -------- (0,5) -------- EMPTY -------- (0,6) yyy-xxxxxxxxxxxxxxxxxxx (0,7) -------- EMPTY -------- .... -------- EMPTY -------- (0,11) yyy-xxxxxxxxxxxxxxxxxxx the row (0,2) --> (0,5) are space available for the (0,1) updates. This will help a table clustered ( for example ) to mantain his own correct cluster order. Regards Gaetano Mendola
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bruce Momjian wrote: | Gaetano Mendola wrote: | |>Tom Lane wrote: |> |> > Bruce Momjian <pgman@candle.pha.pa.us> writes: |> > |> >>Agreed. What I am wondering is with our system where every update gets |> >>a new row, how would this help us? I know we try to keep an update on |> >>the same row as the original, but is there any significant performance |> >>benefit to doing that which would offset the compaction advantage? |> > |> > |> > Because Oracle uses overwrite-in-place (undoing from an UNDO log on |> > transaction abort), while we always write a whole new row, it would take |> > much larger PCTFREE wastage to get a useful benefit in PG than it does |> > in Oracle. That wastage translates directly into increased I/O costs, |> > so I'm a bit dubious that we should assume there is a win to be had here |> > just because Oracle offers the feature. |> |>Mmmm. Consider this scenario: |> |>ctid datas |>(0,1) yyy-xxxxxxxxxxxxxxxxxxx |>(0,2) -------- EMPTY -------- |>(0,3) -------- EMPTY -------- |>(0,4) -------- EMPTY -------- |>(0,5) -------- EMPTY -------- |>(0,6) yyy-xxxxxxxxxxxxxxxxxxx |>(0,7) -------- EMPTY -------- |>.... -------- EMPTY -------- |>(0,11) yyy-xxxxxxxxxxxxxxxxxxx |> |> |>the row (0,2) --> (0,5) are space available for the (0,1) updates. |>This will help a table clustered ( for example ) to mantain his |>own correct cluster order. | | | Right. My point was that non-full fill is valuable for us only when | doing clustering, while for Oracle it is a win even in non-cluster cases | because of the way they update in place. Don't you think this will permit also to avoid extra disk seek and cache invalidation? If you are updating the row (0,1) I think is less expensive put the new version in (0,2) instead of thousand line far from that point. Regards Gaetano Mendola -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBL+kA7UpzwH2SGd4RAp6fAJ9rSs5xmTXsy4acUGcnCRTbEUCwrwCgo/o6 0JPtziuf1E/EGLaqjbPMV44= =pIgX -----END PGP SIGNATURE-----
Gaetano Mendola wrote: > | Right. My point was that non-full fill is valuable for us only when > | doing clustering, while for Oracle it is a win even in non-cluster cases > | because of the way they update in place. > > Don't you think this will permit also to avoid extra disk seek and cache > invalidation? If you are updating the row (0,1) I think is less expensive > put the new version in (0,2) instead of thousand line far from that point. It would, but does that outweigh the decreased I/O by having things more densely packed? I would think not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Don't you think this will permit also to avoid extra disk seek and cache > > invalidation? If you are updating the row (0,1) I think is less expensive > > put the new version in (0,2) instead of thousand line far from that point. Well if the other buffer "a thousand lines far from that point" is already in ram, then no, there's no penalty at the time for storing it there. However it destroys the clustering, which was the original point. > It would, but does that outweigh the decreased I/O by having things more > densely packed? I would think not. Well the dense packing is worth something. But so is the clustering. There's definitely a trade-off. I always found my largest tables are almost always insert-only tables anyways. So in Oracle I would have pctused 100 pctfree 0 on them and get the performance gain. The tables that would benefit from this would be tables always accessed by indexes in index scans of more than one record. The better the clustering the fewer pages the index scan would have to read in. If the data took 10% more space but the index scan only needs 1/4 as many buffers it could be a big net win. -- greg
Greg Stark wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >>>Don't you think this will permit also to avoid extra disk seek and cache >>>invalidation? If you are updating the row (0,1) I think is less expensive >>>put the new version in (0,2) instead of thousand line far from that point. > > > Well if the other buffer "a thousand lines far from that point" is already in > ram, then no, there's no penalty at the time for storing it there. I was wandering about the cache invalidation, may be the ram is big enough but I doubt about the cache, the recommendation in this case is to modify adjacent memory address instead of jumping. Regards Gaetano Mendola
Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains its space saturation as part of each update operation. High activity does indeed result in less-full pages (typically 60-80% full for tables with heavy deletions or rowsize changes). To bring the percentage back up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a normal file defragmenter -- put related disk pages together on the platter. But the performance difference is hardly as severe as I gather it can be if you neglect to vacuum. As for SQL Server being a 'single-user database' ... ummm ... no, I don't think so. I'm REALLY happy to be shut of the Microsoft world, but MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake) Larsen away from academia, and it shows, in the join and aggregate processing. I'll be a happy camper if I manage to contribute something to PG that honks the way their stuff does. Happy to discuss, too. Josh Berkus wrote: > Bruce, > > >>How do vendors actually implement auto-clustering? I assume they move >>rows around during quiet periods or have lots of empty space in each >>value bucket. > > > That's how SQL Server does it. In old versions (6.5) you had to manually > send commands to update the cluster, same as PG. Also, when you create a > cluster (or an index or table for that matter) you can manually set an amount > of "space" to be held open on each data page for updates. > > Also keep in mind that SQL Server, as a "single-user database" has a much > easier time with this. They don't have to hold several versions of an index > in memory and collapse it into a single version at commit time. > > All that being said, we could do a better job of "auto-balancing" clustered > tables. I believe that someone was working on this in Hackers through what > they called "B-Tree Tables". What happened to that? >
I think you've probably fingered the kicker of why PG doesn't have this kind of clustering already. Hence perhaps the need for other approaches to the issue (the disk-IO efficiency of reading groups of rows related by a common key) that other DB's (with in-place update) address with synchronous clustering ('heap rebalancing' ?). Bruce Momjian wrote: > Adi Alurkar wrote: > >>IIRC it it to reduce the "overflow" of data or what oracle calls >>chained rows. i.e if a table has variable length columns and 10 rows >>get inserted into a datapage, if this datapage is full and one of the >>variable length field gets updated the row will now "overflow" into >>another datapage, but if the datapage is created with an appropriate >>amount of free space the updated row will be stored in one single >>datapage. > > > Agreed. What I am wondering is with our system where every update gets > a new row, how would this help us? I know we try to keep an update on > the same row as the original, but is there any significant performance > benefit to doing that which would offset the compaction advantage? >
This discussion is starting to sound like the split in HEAP memory management evolution, into garbage-collecting (e.g. Java) and non-garbage-collecting (e.g. C++). Reclamation by GC's these days has become seriously sophisticated. CLUSTER resembles the first generation of GC's, which were single-big-pass hold-everything-else threads. Perhaps the latest in incremental GC algorithms would be worth scouting, for the next step in PG page management. Greg Stark wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>but is there any significant performance benefit to doing that which would >>offset the compaction advantage? > > Just as a side comment. Setting PCTFREE 0 PCTUSED 100 on tables that have no > updates on them has an astonishingly big effect on speed. So the penalty for > leaving some space free really is substantial. > > I think the other poster is right. Oracle really needs pctfree because of the > way it handles updates. Postgres doesn't really need as much because it > doesn't try to squeeze the new tuple in the space the old one took up. If it > doesn't fit on the page the worst that happens is it has to store it on some > other page, whereas oracle has to do its strange row chaining thing.
Sheer nitpick here... A B-tree is where the records (data) live at all levels of the tree; B+ tree is where the records are only at the leaf level. That's what Knuth calls them, anyway. Clustered indexes for all known dbs are true B+ trees. Nonclustered indexes could be B-trees (probably aren't), since there's no big fanout penalty for storing the little (heap) row locators everywhere at all levels. J. Andrew Rogers wrote: > As far as I know, Oracle does it by having a B-Tree organized heap (a > feature introduced around v8 IIRC), basically making the primary key > index and the heap the same physical structure. ...
J. Andrew Rogers wrote: > On Thu, 2004-08-26 at 12:30, Magnus Hagander wrote: >>IIRC, SQL Server always creates clustered indexes by default for primary >>keys. > > That would surprise me actually. Yaz, it should. It doesn't ALWAYS create clustered (unique) index for primary keys, but clustered is the default if you just specify CREATE TABLE Foo (col1, ... ,PRIMARY KEY(col1, ...) ) Saying PRIMARY KEY NONCLUSTERED(...) is how you override the default. ((Weird to be discussing so much MSSQL here))
Mishca, > Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains > its space saturation as part of each update operation. High activity > does indeed result in less-full pages (typically 60-80% full for tables > with heavy deletions or rowsize changes). To bring the percentage back > up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a > normal file defragmenter -- put related disk pages together on the platter. Sure, it does now, which is a nice thing. It didn't in the first version (6.5) where this cluster maint needed to be done manually and asynchronously, as I recall. > As for SQL Server being a 'single-user database' ... ummm ... no, I > don't think so. Hmmm ... perhaps it would be better if I said "serial-only database". MSSQL (like earlier versions of Sybase) handles transactions by spooling everything out to a serial log, effectively making all transcations SERIAL isolation. This has some significant benefits in performance for OLAP and data warehousing, but really kills you on high-concurrency transaction processing. > I'm REALLY happy to be shut of the Microsoft world, but > MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious > bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake) > Larsen away from academia, and it shows, in the join and aggregate > processing. I'll be a happy camper if I manage to contribute something > to PG that honks the way their stuff does. Happy to discuss, too. Yeah, they also have very speedy cursor handling. I can do things with cursors in MSSQL that I wouldn't consider with other DBs. Not that that makes up for the lack of other functionality, but it is nice when you need it. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Thu, Aug 26, 2004 at 12:04:48PM -0700, J. Andrew Rogers wrote: > The major caveat to having tables of this type is that you can only have > a primary key index. No other indexes are possible because the "heap" > constantly undergoes local reorganizations if you have a lot of write > traffic, the same kind of reorganization you would normally expect in a > BTree index. This isn't true, at least in 9i. You can create whatever indexes you want on an index-organized table. I believe that the index stores the PK value instead of the ROWID. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Thu, Aug 26, 2004 at 11:39:42PM -0400, Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Updated TODO item: > > > > o Automatically maintain clustering on a table > > > > This would require some background daemon to maintain clustering > > during periods of low usage. It might also require tables to be only > > paritally filled for easier reorganization. It also might require > > creating a merged heap/index data file so an index lookup would > > automatically access the heap data too. > > Fwiw, I would say the first "would" is also a "might". None of the previous > discussions here presumed a maintenance daemon. The discussions before talked > about a mechanism to try to place new tuples as close as possible to the > proper index position. > > I would also suggest making some distinction between a cluster system similar > to what we have now but improved to maintain the clustering continuously, and > an actual index-organized-table where the tuples are actually only stored in a > btree structure. > > They're two different approaches to similar problems. But they might both be > useful to have, and have markedly different implementation details. There's a third approach that I think is worth considering. Half of the benefit to clustered tables is limiting the number of pages you need to access when scanning the primary key. The order of tuples in the pages themselves isn't nearly as important as ordering of the pages. This means you can get most of the benefit of an index-organized table just by being careful about what page you place a tuple on. What I'm thinking of is some means to ensure all the tuples on a page are within some PK range, but not worrying about the exact order within the page since it's relatively cheap to scan through the page in memory. Some pros: This would probably mean less change to the code that inserts tuples. No need for a background daemon. No need to create a new B-Tree table structure. Ideally, there won't be need to move tuples around, which should mean that current indexing code doesn't need to change. Cons: Need to have some way to deal with pages that fill up. To gain full benefit some means of indicating what range of PK values are on a page might be needed. It's not as beneficial as a true IOT since you don't get the benefit of storing your tuples inline with your B-Tree. I'm sure there's a ton of things I'm missing, especially since I'm not familiar with the postgresql code, but hopefully others can explore this further. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Mischa Sandberg wrote: > Coming from the MSSQL world, I'm used to the first step in optimization > to be, choose your clustered index and choose it well. > I see that PG has a one-shot CLUSTER command, but doesn't support > continuously-updated clustered indexes. > What I infer from newsgroup browsing is, such an index is impossible, > given the MVCC versioning of records (happy to learn I'm wrong). > I'd be curious to know what other people, who've crossed this same > bridge from MSSQL or Oracle or Sybase to PG, have devised, > faced with the same kind of desired performance gain for retrieving > blocks of rows with the same partial key. Just to let people know, after trying various options, this looks the most promising: - segment the original table into four tables (call them A,B,C,D) - all insertions go into A. - longterm data lives in B. - primary keys of all requests to delete rows from (B) go into D -- no actual deletions are done against B. Deletions against A happen as normal. - all queries are made against a view: a union of A and B and (not exists) D. - daily merge A,B and (where not exists...) D, into C - run cluster on C, then swap names on B and C, truncate A and D. Not rocket science, but it seems to give the payback of normal clustering without locking the table for long periods of time. It also saves on VACUUM FULL time. At present, we're only at 1M rows in B on this. More when I know it. Advance warning on any gotchas with this approach would be much appreciated. Making a complete copy of (B) is a bit of an ouch.