Thread: Key features for data warehousing
Hi, I would love to recommend PG to our management, mostly because MVCC does not have the severe problems with long transactions in high-concurrency environments that locking databases have. Here are the reasons why I can't: 1. The CLUSTER statement allows one-time physical row reordering. But a real clustered index, available in DB2, Sybase and Informix, maintains the clustering permanently and uses less space than an ordinary index (which copies the key columns). This is on the TODO list, but may be difficult to do with MVCC. However I wouldn't miss it at all if we could have MDC instead! 2. MDC (Multi-Dimensional Clustering) is great for data warehousing. For keys from relatively small finite sets, MDC allows efficient range queries and joins in multiple dimensions, maintaining the cluistering without explicit reorganisation. Key vectors are not stored repeatedly. http://www7b.software.ibm.com/dmdd/library/techarticle/0207huras/0207huras.h tml http://216.239.39.100/search?q=cache:RGwncC6_dxUC:www-8.ibm.com/software/au/ universe/download/db2/id112.pdf+db2+multidimensional+clustering+organized&hl =en&ie=UTF-8 3. I believe the row overhead is 40 bytes (recently reduced to 36 or 32 ?). This is WAY too much for fact tables with many small rows. High-end storage is pretty expensive, and less rows per page means more I/O. I believe Sybase and Informix can do with 4, and Interbase / Firebird (using MVCC) with 16 bytes per row. Could anyone kindly explain why PostgreSQL needs this much space? + We need a pointer to the row (4 bytes) - unless the row size is fix. + There is the OID - unless deactivated in the table definition. (I always wished OIDs were off by default, for better portability.) + For MVCC, each row must be associated with a transaction number number or version - unless it is not involved in any transaction, which is usually true for most rows, and after a restart for all. There must be a way to recognize obsolete versions in VACUUM or after a crash, but why isn't a tristate enough (committed/uncommitted/rolled back) ? + Why not hold the "version map" in a system table, only for versions that are valid and not involved in any transactions? + Why not hold the "version map" in RAM, swapping or explicitly paging parts to the disk when necessary? + And finally, what is all the rest for? 4. Unsexy but badly needed: Replication. I would happily drop other alternatives if Replication and MDC were implemented and the row overhead reduced to 16 bytes or less. Looking forward to your answers and replies. Best regards, Oliver
On Wed, Apr 09, 2003 at 01:38:15PM +0200, Reiter, Oliver wrote: > 4. Unsexy but badly needed: Replication. There is no synchronous replication, but async approaches are available now, in both free-license and commercial-license forms. (Can't help with your other issues, though.) A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Wed, Apr 09, 2003 at 01:38:15PM +0200, Reiter, Oliver wrote: > 3. I believe the row overhead is 40 bytes (recently reduced to 36 or 32 ?). > This is WAY too much for fact tables with many small rows. High-end > storage is pretty expensive, and less rows per page means more I/O. > I believe Sybase and Informix can do with 4, and Interbase / Firebird > (using MVCC) with 16 bytes per row. > > Could anyone kindly explain why PostgreSQL needs this much space? See http://developer.postgresql.org/docs/postgres/page.html about two-thirds down the page. If you can see a way to remove most of those values, I'm sure the developers would love to hear from you! > + We need a pointer to the row (4 bytes) - unless the row size is fix. > + There is the OID - unless deactivated in the table definition. > (I always wished OIDs were off by default, for better portability.) > + For MVCC, each row must be associated with a transaction number > number or version - unless it is not involved in any transaction, > which > is usually true for most rows, and after a restart for all. There > must be > a way to recognize obsolete versions in VACUUM or after a crash, but > why isn't a tristate enough (committed/uncommitted/rolled back) ? The transaction number defines which transactions can currently see this tuple. This is what the xmin, cmin, xmax, cmax fields are for. > + Why not hold the "version map" in a system table, only for versions > that are valid and not involved in any transactions? > + Why not hold the "version map" in RAM, swapping or explicitly > paging parts to the disk when necessary? So that when you crash you can still tell which tuples are valid. Consider: begin; update table set ...; *crash* After the update you have all the rows updated stored twice and no way to tell which one is the right one except for the xmin,xmax values and the list of commited transactions. So, these need to be stored on disk. Preferably with the tuple so you don't end up with problem of the system table getting out of sync with the data table when the disk lost one of the updates while the machine was dying. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Attachment
On Wed, 2003-04-09 at 19:38, Martijn van Oosterhout wrote: > On Wed, Apr 09, 2003 at 01:38:15PM +0200, Reiter, Oliver wrote: [snip] > So that when you crash you can still tell which tuples are valid. Consider: > > begin; > update table set ...; > *crash* > > After the update you have all the rows updated stored twice and no way to > tell which one is the right one except for the xmin,xmax values and the list > of commited transactions. So, these need to be stored on disk. Preferably > with the tuple so you don't end up with problem of the system table getting > out of sync with the data table when the disk lost one of the updates while > the machine was dying. And this is the "price" that the PG developers decided to pay when they decided to use MVCC instead of some other scheme... Maybe PG just isn't great at DW? -- +----------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "A C program is like a fast dance on a newly waxed dance floor | | by people carrying razors." | | Waldi Ravens | +----------------------------------------------------------------+