Thread: Data archiving/warehousing idea
G'day hackers, I had some hand-wavy thoughts about some potential gains for postgres in the data archiving/warehousing area. I'm not able to do any work myself on this, and don't actually have a pressing need for it so I'm not "requesting" someone do it, but I thought it might be worth discussing (if it hasn't been already - I couldn't find anything in the mail archives, but that doesn't mean it's not there...) The main idea is that, there might be space utilisation and performance advantages if postgres had "hard" read-only tables, i.e. tables which were guaranteed (by postgres) to never have their data changed (insert/update/delete). This could potentially save disk space by allowing "book keeping" elements in the page/tuple headers to be removed, e.g. visibility information etc. Also, some indexes could potentially be packed tighter if we know the data will never change (of course this is already available using the fillfactor control). There could be performance advantages from areas like: * more efficient disk buffering due to reduced disk space requirements per above. * no need to visit tuple store for visibility info during index scan * greatly reduced or even completely removed locking. If the table is guaranteed read-only, there's no need to lock? * Planner optimisation? E.g. changing the cost of index and sequential scans for the table due to the previous points,and there might be table stats which would be very useful to the planner but which are too expensive to maintainfor changing data. The idea would be to introduce a statement something like: ALTER TABLE foo SET ARCHIVE; This would tell postgres to rewrite the on-disk table to the "read only" format, rewrite the indexes for maximum packing and collect stats for the planner etc. Thoughts? Brickbats? Cheers, Chris.
On Thu, 1 Feb 2007, Chris Dunlop wrote: > G'day hackers, G'Day Chris, > already - I couldn't find anything in the mail archives, but > that doesn't mean it's not there...) There has been a lot of discussion about this kind of thing over the years. > The main idea is that, there might be space utilisation and > performance advantages if postgres had "hard" read-only tables, > i.e. tables which were guaranteed (by postgres) to never have > their data changed (insert/update/delete). > > This could potentially save disk space by allowing "book > keeping" elements in the page/tuple headers to be removed, e.g. > visibility information etc. Also, some indexes could > potentially be packed tighter if we know the data will never > change (of course this is already available using the fillfactor > control). Well, there is also CPU overhead doing MVCC but there are a few fundamental problems that must be overcome. The most significant is that no useful table is always read only, otherwise you could never load it. What do we do in the presence of a failure during the load or a user issued ABORT? I guess we'd truncate the table... What about replay after a crash? Another way of looking at it is, we use the 'bookkeeping' information in the tuple header for concurrency and for handling the abortion of the transaction. > The idea would be to introduce a statement something like: > > ALTER TABLE foo SET ARCHIVE; I'd not thought of that approach. There are two problems: some archive tables are so large that loading them and then reprocessing them isn't appealing. Secondly, we'd be rewriting the binary structure of the table and this does not suit the non-overwriting nature of Postgres's storage system. A different approach discussed earlier involves greatly restricting the way in which the table is used. This table could only be written to if an exclusive lock is held; on error or ABORT, the table is truncated. The problem is that a lot of this looks like a hack and I haven't seen a very clean approach which has gone beyond basic brain dump. Thanks, Gavin
G'day Gavin, In maillist.postgres.dev, you wrote: > On Thu, 1 Feb 2007, Chris Dunlop wrote: >> The main idea is that, there might be space utilisation and >> performance advantages if postgres had "hard" read-only >> tables, i.e. tables which were guaranteed (by postgres) to >> never have their data changed (insert/update/delete). >> >> This could potentially save disk space by allowing "book >> keeping" elements in the page/tuple headers to be removed, >> e.g. visibility information etc. Also, some indexes could >> potentially be packed tighter if we know the data will never >> change (of course this is already available using the >> fillfactor control). > > Well, there is also CPU overhead doing MVCC but there are a > few fundamental problems that must be overcome. The most > significant is that no useful table is always read only, > otherwise you could never load it. Yes, that's why I was proposing ALTER TABLE... SET ARCHIVE rather than CREATE TABLE... ARCHIVE. (Although, for consistency, perhaps the CREATE TABLE would be allowed, it's just that you couldn't load anything into it until you did a ALTER TABLE... DROP ARCHIVE.) > What do we do in the presence of a failure during the load or > a user issued ABORT? I guess we'd truncate the table... I was thinking the load simply couldn't happen if the table were SET ARCHIVE. > What about replay after a crash? No replay would be required on that table as it would *NOT* be changed once an SET ARCHIVE were done (unless a DROP ARCHIVE were done). > Another way of looking at it is, we use the 'bookkeeping' > information in the tuple header for concurrency and for > handling the abortion of the transaction. So, unless there's something I'm missing (not completely unlikely!), as long as the table (including it's on-disk representation) was never changed, the bookkeeping information wouldn't be required? >> The idea would be to introduce a statement something like: >> >> ALTER TABLE foo SET ARCHIVE; > > I'd not thought of that approach. There are two problems: some > archive tables are so large that loading them and then > reprocessing them isn't appealing. Yes - it would only work if you were prepared to wear the cost of the SET ARCHIVE, which could certainly be considerable. ...oh, I think I see what you were getting at above: you were thinking of loading the data into the already SET ARCHIVE table to avoid the considerable cost of rewriting the disk format etc. I hadn't considered that, but yes, if you were to allow that I suppose in the presence of load errors or ABORTS etc. the table could simply be truncated. (For whatever value of "simply" is appropriate!) > Secondly, we'd be rewriting the binary structure of the table > and this does not suit the non-overwriting nature of > Postgres's storage system. Rather than writing in-place, perhaps the SET ARCHIVE would create a on-disk copy of the table. Of course this would demand you have twice the disk space available which may be prohibitive in a large warehouse. On the other hand, I'm not sure if you would have a single humongous table that you'd SET ARCHIVE on, you might be as likely to archive on a weekly or yearly or whatever is manageable basis, along the lines of: begin; select * into foo_2006 from foo where date_trunc('year', timestamp) = '2006-01-01'; delete from foo where date_trunc('year',timestamp) = '2006-01-01'; alter table foo_2006 set archive; alter table foo_2006 inherit foo; commit; > A different approach discussed earlier involves greatly > restricting the way in which the table is used. This table > could only be written to if an exclusive lock is held; on > error or ABORT, the table is truncated. You're talking about the "no-WAL" concept? Not quite the same thing I think, but perhaps complimentary to the ARCHIVE idea: I wouldn't expect an ARCHIVE table to need to generate any WAL entries as it would be read only. Cheers, Chris.
On 2/1/07, Chris Dunlop wrote: > In maillist.postgres.dev, you wrote: >> On Thu, 1 Feb 2007, Chris Dunlop wrote: >>> The main idea is that, there might be space utilisation and >>> performance advantages if postgres had "hard" read-only >>> tables, i.e. tables which were guaranteed (by postgres) to >>> never have their data changed (insert/update/delete). >>> >>> This could potentially save disk space by allowing "book >>> keeping" elements in the page/tuple headers to be removed, >>> e.g. visibility information etc. Also, some indexes could >>> potentially be packed tighter if we know the data will never >>> change (of course this is already available using the >>> fillfactor control). >> >> Well, there is also CPU overhead doing MVCC but there are a >> few fundamental problems that must be overcome. The most >> significant is that no useful table is always read only, >> otherwise you could never load it. > > Yes, that's why I was proposing ALTER TABLE... SET ARCHIVE > rather than CREATE TABLE... ARCHIVE. (Although, for > consistency, perhaps the CREATE TABLE would be allowed, it's > just that you couldn't load anything into it until you did a > ALTER TABLE... DROP ARCHIVE.) If you want to squeeze the most out of it, CLUSTER would need to be able to set the archive option too. (Otherwise you first rewrite the table when you cluster it, and then rewrite it again when you archive it.) > Rather than writing in-place, perhaps the SET ARCHIVE would > create a on-disk copy of the table. Just like CLUSTER does now: create an on-disk copy first and swap the relfilenodes of the files and flush the relcache. > Of course this would demand > you have twice the disk space available which may be prohibitive > in a large warehouse. On the other hand, I'm not sure if you > would have a single humongous table that you'd SET ARCHIVE on, > you might be as likely to archive on a weekly or yearly or > whatever is manageable basis, along the lines of: > > begin; > select * into foo_2006 from foo > where date_trunc('year', timestamp) = '2006-01-01'; > delete from foo > where date_trunc('year', timestamp) = '2006-01-01'; > alter table foo_2006 set archive; > alter table foo_2006 inherit foo; > commit; Ideally you let most of it run outside a transaction: create table foo_2006 (); insert into foo_2006 SELECT * from foo where ....; cluster foo_2006 on ... ARCHIVE; begin; delete from foo where PK in select PK from foo_2006; alter table foo_2006 inherit foo; commit; > You're talking about the "no-WAL" concept? Not quite the same > thing I think, but perhaps complimentary to the ARCHIVE idea: I > wouldn't expect an ARCHIVE table to need to generate any WAL > entries as it would be read only. The WAL gains come automatically when data isn't changed. But there are additional advantages that can be obtained with archive tables: - no need to vacuum them, not even for XID rollover (but you do need to vacuum their entries in the catalogs) - index entries are always valid so no need to check the heap (unless the index is lossy) - you can force the fillfactor to 100% regardless of the setting - more agressive compression of toast tables Jochem
On Thu, 2007-02-01 at 15:03 +1100, Chris Dunlop wrote: > > A different approach discussed earlier involves greatly > > restricting the way in which the table is used. This table > > could only be written to if an exclusive lock is held; on > > error or ABORT, the table is truncated. > > You're talking about the "no-WAL" concept? Not quite the same > thing I think, but perhaps complimentary to the ARCHIVE idea: I > wouldn't expect an ARCHIVE table to need to generate any WAL > entries as it would be read only. Setting an option to reduce the size of the row headers needs to be done before its loaded, not after. If you mark a table as APPEND-only it would be possible to save 4 bytes off the row header. Saving a further 8-bytes off the row header by dropping the xmin and commandid fields isn't a very workable solution since those fields provide visibility within a transaction and across transactions. You'd end up with a table that could only be loaded by one transaction and only touched by one command within that transaction. That would require bookkeeping on the relation itself (in pg_class) and an AccessExclusiveLock. In those circumstances you would be able to save on writing WAL as well. You'd be basically saying that it could only be loaded by a special utility and would be read-only once loaded. I'm not crazy about those ideas, even though I think I suggested them some time back. They're very special case and would probably require many functions to handle multiple cases, so that additional complexity and cost would effect all users. My main doubt relates to the data you're storing. If the events you're interested in have so little information associated with them that they are a thin table then storing them at all is questionable. I've been involved at the design stage of a number of data warehouses and the amount of data eventually stored is typically < 10% of the number people first thought of, with number of rows decreasing drastically and the number of columns increasing slightly as people try to gain additional value from their data storage. So row overhead should be less of a concern. It is possible to save on WAL by doing COPY LOCK or using a table with different resilience characteristics, both ideas are already on the TODO. There are also better, less intrusive ways of reducing data volume and improving load performance. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry: > A different approach discussed earlier involves greatly restricting the > way in which the table is used. This table could only be written to if an > exclusive lock is held; on error or ABORT, the table is truncated. > > The problem is that a lot of this looks like a hack and I haven't seen a > very clean approach which has gone beyond basic brain dump. A more radical variation of the "restricted-use archive table" approach is storing all tuple visibility info in a separate file. At first it seems to just add overhead, but for lots (most ? ) usecases the separately stored visibility should be highly compressible, so for example for bulk-loaded tables you could end up with one bit per page saying that all tuples on this page are visible. Also this could be used to speed up vacuums, as only the visibility table needs to be scanned duting phase 1 of vacuum, and so tables with localised/moving hotspots can be vacuumed withoutd scanning lots of static data. Also, storing the whole visibility info, but in a separate heap, lifts all restrictions of the "restricted-use archive table" variant. And the compression of visibility info (mostly replacing per-tuple info with per-page info) can be carried out by a separate vacuum-like process. And it has many of the benefits of static/RO tables, like space saving and index-only queries. Index-only will of course need to get the visibility info from visibility heap, but if it is mostly heavily compressed, it will be a lot cheaper than random access to data heap. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Ühel kenal päeval, N, 2007-02-01 kell 14:38, kirjutas Hannu Krosing: > Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry: > > > A different approach discussed earlier involves greatly restricting the > > way in which the table is used. This table could only be written to if an > > exclusive lock is held; on error or ABORT, the table is truncated. > > > > The problem is that a lot of this looks like a hack and I haven't seen a > > very clean approach which has gone beyond basic brain dump. > > A more radical variation of the "restricted-use archive table" approach > is storing all tuple visibility info in a separate file. > > At first it seems to just add overhead, but for lots (most ? ) usecases > the separately stored visibility should be highly compressible, so for > example for bulk-loaded tables you could end up with one bit per page > saying that all tuples on this page are visible. > > Also this could be used to speed up vacuums, as only the visibility > table needs to be scanned duting phase 1 of vacuum, and so tables with > localised/moving hotspots can be vacuumed withoutd scanning lots of > static data. > > Also, storing the whole visibility info, but in a separate heap, lifts > all restrictions of the "restricted-use archive table" variant. > > And the compression of visibility info (mostly replacing per-tuple info > with per-page info) can be carried out by a separate vacuum-like > process. > > And it has many of the benefits of static/RO tables, like space saving > and index-only queries. Index-only will of course need to get the > visibility info from visibility heap, but if it is mostly heavily > compressed, it will be a lot cheaper than random access to data heap. For tables with fixed-width tuples it can probably be extended to support vertical fragmentation as well, to get DWH benefits similar to http://monetdb.cwi.nl/ . -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Thu, 2007-02-01 at 14:38 +0200, Hannu Krosing wrote: > Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry: > > > A different approach discussed earlier involves greatly restricting the > > way in which the table is used. This table could only be written to if an > > exclusive lock is held; on error or ABORT, the table is truncated. > > > > The problem is that a lot of this looks like a hack and I haven't seen a > > very clean approach which has gone beyond basic brain dump. > > A more radical variation of the "restricted-use archive table" approach > is storing all tuple visibility info in a separate file. > > At first it seems to just add overhead, but for lots (most ? ) usecases > the separately stored visibility should be highly compressible, so for > example for bulk-loaded tables you could end up with one bit per page > saying that all tuples on this page are visible. > > Also this could be used to speed up vacuums, as only the visibility > table needs to be scanned duting phase 1 of vacuum, and so tables with > localised/moving hotspots can be vacuumed withoutd scanning lots of > static data. > > Also, storing the whole visibility info, but in a separate heap, lifts > all restrictions of the "restricted-use archive table" variant. > > And the compression of visibility info (mostly replacing per-tuple info > with per-page info) can be carried out by a separate vacuum-like > process. > > And it has many of the benefits of static/RO tables, like space saving > and index-only queries. Index-only will of course need to get the > visibility info from visibility heap, but if it is mostly heavily > compressed, it will be a lot cheaper than random access to data heap. I like that idea, as a non-default option, since in-line visibility is important for OLTP applications. This idea is sufficiently flexible to allow a range of use cases without requiring a complete removal of functionality. Read-mostly is an important use case for very large databases. This is essentially the same thing as PhantomCommandId, just for the whole tuple header. It's something that would go on pg_class easily, just as WITH/WITHOUT OIDS has done. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hannu Krosing <hannu@skype.net> writes: > A more radical variation of the "restricted-use archive table" approach > is storing all tuple visibility info in a separate file. > At first it seems to just add overhead, but for lots (most ? ) usecases > the separately stored visibility should be highly compressible, so for > example for bulk-loaded tables you could end up with one bit per page > saying that all tuples on this page are visible. The more you compress, the slower and more complicated it will be to access the information. I'd put my money on this being a net loss in the majority of scenarios. regards, tom lane
Hannu Krosing wrote: > ...is storing all tuple visibility info in a separate file. > > At first it seems to just add overhead, but for lots (most ? ) usecases > the separately stored visibility should be highly compressible, so for > example for bulk-loaded tables you could end up with one bit per page > saying that all tuples on this page are visible. Seems you could do "one bit per page" compression even with visibility data stored in the pages themselves. I could imagine a table re-writing "vacuum freeze" that finds pages with all data visible to everyone and packs them with a single bit saying "everything here's frozen". The penalty would be an expensive splitting of the page (and who knows what evil locks would be needed) if an update is ever done on those wholly frozen pages -- but we're talking about read-mostly tables here so that tradeoff might not be bad.
Ühel kenal päeval, N, 2007-02-01 kell 12:31, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > A more radical variation of the "restricted-use archive table" approach > > is storing all tuple visibility info in a separate file. > > At first it seems to just add overhead, but for lots (most ? ) usecases > > the separately stored visibility should be highly compressible, so for > > example for bulk-loaded tables you could end up with one bit per page > > saying that all tuples on this page are visible. > > The more you compress, the slower and more complicated it will be to > access the information. I'd put my money on this being a net loss in > the majority of scenarios. define "majority" :) In real life it is often faster to access compressed information, especially if it is stored in something like trie where compression and indeked access are the same thing. the most gain will of course come from bulk-loaded data, where the "compressed" representation can just say something like "pages 1 to 200000 are all visible starting from transaction 5000 and so is first half of page 200001, second half of 200001 and pages up to 250000 are visible from trx 6000. In this case the visibility info will always stay in L1 cache and thus be really fast, maybe even free if we account for cache reloads and such. But it may be better to still have a bitmap there for sake of simplicity and have some of it be accessible from L2 cache (200000 pages times say 2 bits is still only 100kB bitmap for 1.6GB of data). Of course there are cases where this approach is worse, sometimes much woorse, than current one, but the possibility of independently compressing visibility info and making some types of VACUUM vastly cheaper may make it a net win in several cases. Also, for higly dynamic tables a separate visibility heap might also speed up bitmap scans, as access to heap happens only for visible tuples. This can also be one way to get rid of need to write full data tuples thrice - first the original write, then commit bits and then deleted bits. instead we can just write the bits once for a whole set of tuples. Usually most of visibility info can be thrown out quite soon, as the active transaction window advances, so the first level of "compression" is just thtrowing out cmin/cmax and setting commit bit, then setting the tuple to just "visible". -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Jochem van Dieten wrote: > On 2/1/07, Chris Dunlop wrote: >> In maillist.postgres.dev, you wrote: >> Rather than writing in-place, perhaps the SET ARCHIVE would >> create a on-disk copy of the table. > > Just like CLUSTER does now: create an on-disk copy first and swap the > relfilenodes of the files and flush the relcache. IIRC, cluster currently needs to take on exclusive lock of the table, thus preventing any concurrent selects. I assume it would be the same for "alter table ... set archive". For a large readonly table - the ones that "set archive" would be used for - rewriting the whole table might easily take a few hours, if not days. Blocking reads for such a long time might be unacceptable in a lot of environments, severely limiting the use-cases for "alter table ... set archive" I think that both "cluster" and "truncate" should in theory only need to prevent concurrent updates, not concurrent selects. AFAIK, the reason they need to take an exclusive lock is because there is no way to let other backend see the "old" relfilenode entry in pg_class until the cluster/truncate commits. So I believe that this limitation would first have to be removed, before a "alter table ... set archive" would become really usefull... Just my 0.02 eurocents. greetings, Florian Pflug