Thread: Data archiving/warehousing idea

Data archiving/warehousing idea

From
Chris Dunlop
Date:
G'day hackers,

I had some hand-wavy thoughts about some potential gains for
postgres in the data archiving/warehousing area.  I'm not able
to do any work myself on this, and don't actually have a
pressing need for it so I'm not "requesting" someone do it, but
I thought it might be worth discussing (if it hasn't been
already - I couldn't find anything in the mail archives, but
that doesn't mean it's not there...)

The main idea is that, there might be space utilisation and
performance advantages if postgres had "hard" read-only tables,
i.e. tables which were guaranteed (by postgres) to never have
their data changed (insert/update/delete). 

This could potentially save disk space by allowing "book
keeping" elements in the page/tuple headers to be removed, e.g.
visibility information etc.  Also, some indexes could
potentially be packed tighter if we know the data will never
change (of course this is already available using the fillfactor
control).

There could be performance advantages from areas like:
 * more efficient disk buffering due to reduced disk space   requirements per above.
 * no need to visit tuple store for visibility info   during index scan
 * greatly reduced or even completely removed locking.  If the   table is guaranteed read-only, there's no need to
lock?
 * Planner optimisation?  E.g. changing the cost of   index and sequential scans for the table due to the previous
points,and there might be table stats which would be very   useful to the planner but which are too expensive to
maintainfor changing data.
 

The idea would be to introduce a statement something like:
 ALTER TABLE foo SET ARCHIVE;

This would tell postgres to rewrite the on-disk table to the
"read only" format, rewrite the indexes for maximum packing
and collect stats for the planner etc.

Thoughts?  Brickbats?


Cheers,

Chris.


Re: Data archiving/warehousing idea

From
Gavin Sherry
Date:
On Thu, 1 Feb 2007, Chris Dunlop wrote:

> G'day hackers,

G'Day Chris,

> already - I couldn't find anything in the mail archives, but
> that doesn't mean it's not there...)

There has been a lot of discussion about this kind of thing over the
years.

> The main idea is that, there might be space utilisation and
> performance advantages if postgres had "hard" read-only tables,
> i.e. tables which were guaranteed (by postgres) to never have
> their data changed (insert/update/delete).
>
> This could potentially save disk space by allowing "book
> keeping" elements in the page/tuple headers to be removed, e.g.
> visibility information etc.  Also, some indexes could
> potentially be packed tighter if we know the data will never
> change (of course this is already available using the fillfactor
> control).

Well, there is also CPU overhead doing MVCC but there are a few
fundamental problems that must be overcome. The most significant is that
no useful table is always read only, otherwise you could never load it.
What do we do in the presence of a failure during the load or a user
issued ABORT? I guess we'd truncate the table... What about replay after a
crash?

Another way of looking at it is, we use the 'bookkeeping' information in
the tuple header for concurrency and for handling the abortion of the
transaction.

> The idea would be to introduce a statement something like:
>
>   ALTER TABLE foo SET ARCHIVE;

I'd not thought of that approach. There are two problems: some archive
tables are so large that loading them and then reprocessing them isn't
appealing. Secondly, we'd be rewriting the binary structure of the table
and this does not suit the non-overwriting nature of Postgres's storage
system.

A different approach discussed earlier involves greatly restricting the
way in which the table is used. This table could only be written to if an
exclusive lock is held; on error or ABORT, the table is truncated.

The problem is that a lot of this looks like a hack and I haven't seen a
very clean approach which has gone beyond basic brain dump.

Thanks,

Gavin


Re: Data archiving/warehousing idea

From
Chris Dunlop
Date:
G'day Gavin,

In maillist.postgres.dev, you wrote:
> On Thu, 1 Feb 2007, Chris Dunlop wrote:
>> The main idea is that, there might be space utilisation and
>> performance advantages if postgres had "hard" read-only
>> tables, i.e.  tables which were guaranteed (by postgres) to
>> never have their data changed (insert/update/delete).
>>
>> This could potentially save disk space by allowing "book
>> keeping" elements in the page/tuple headers to be removed,
>> e.g.  visibility information etc.  Also, some indexes could
>> potentially be packed tighter if we know the data will never
>> change (of course this is already available using the
>> fillfactor control).
>
> Well, there is also CPU overhead doing MVCC but there are a
> few fundamental problems that must be overcome. The most
> significant is that no useful table is always read only,
> otherwise you could never load it.

Yes, that's why I was proposing ALTER TABLE... SET ARCHIVE
rather than CREATE TABLE... ARCHIVE.  (Although, for
consistency, perhaps the CREATE TABLE would be allowed, it's
just that you couldn't load anything into it until you did a
ALTER TABLE... DROP ARCHIVE.)

> What do we do in the presence of a failure during the load or
> a user issued ABORT? I guess we'd truncate the table...

I was thinking the load simply couldn't happen if the table were
SET ARCHIVE.

> What about replay after a crash?

No replay would be required on that table as it would *NOT* be
changed once an SET ARCHIVE were done (unless a DROP ARCHIVE
were done).

> Another way of looking at it is, we use the 'bookkeeping'
> information in the tuple header for concurrency and for
> handling the abortion of the transaction.

So, unless there's something I'm missing (not completely
unlikely!), as long as the table (including it's on-disk
representation) was never changed, the bookkeeping
information wouldn't be required?

>> The idea would be to introduce a statement something like:
>>
>>   ALTER TABLE foo SET ARCHIVE;
>
> I'd not thought of that approach. There are two problems: some
> archive tables are so large that loading them and then
> reprocessing them isn't appealing.

Yes - it would only work if you were prepared to wear the cost
of the SET ARCHIVE, which could certainly be considerable.

...oh, I think I see what you were getting at above: you were
thinking of loading the data into the already SET ARCHIVE table
to avoid the considerable cost of rewriting the disk format etc.
I hadn't considered that, but yes, if you were to allow that I
suppose in the presence of load errors or ABORTS etc. the table
could simply be truncated.  (For whatever value of "simply" is
appropriate!)

> Secondly, we'd be rewriting the binary structure of the table
> and this does not suit the non-overwriting nature of
> Postgres's storage system.

Rather than writing in-place, perhaps the SET ARCHIVE would
create a on-disk copy of the table.  Of course this would demand
you have twice the disk space available which may be prohibitive
in a large warehouse.  On the other hand, I'm not sure if you
would have a single humongous table that you'd SET ARCHIVE on,
you might be as likely to archive on a weekly or yearly or
whatever is manageable basis, along the lines of:
 begin; select * into foo_2006 from foo   where date_trunc('year', timestamp) = '2006-01-01'; delete from foo   where
date_trunc('year',timestamp) = '2006-01-01'; alter table foo_2006 set archive; alter table foo_2006 inherit foo;
commit;

> A different approach discussed earlier involves greatly
> restricting the way in which the table is used. This table
> could only be written to if an exclusive lock is held; on
> error or ABORT, the table is truncated.

You're talking about the "no-WAL" concept?  Not quite the same
thing I think, but perhaps complimentary to the ARCHIVE idea: I
wouldn't expect an ARCHIVE table to need to generate any WAL
entries as it would be read only.

Cheers,

Chris.


Re: Data archiving/warehousing idea

From
"Jochem van Dieten"
Date:
On 2/1/07, Chris Dunlop wrote:
> In maillist.postgres.dev, you wrote:
>> On Thu, 1 Feb 2007, Chris Dunlop wrote:
>>> The main idea is that, there might be space utilisation and
>>> performance advantages if postgres had "hard" read-only
>>> tables, i.e.  tables which were guaranteed (by postgres) to
>>> never have their data changed (insert/update/delete).
>>>
>>> This could potentially save disk space by allowing "book
>>> keeping" elements in the page/tuple headers to be removed,
>>> e.g.  visibility information etc.  Also, some indexes could
>>> potentially be packed tighter if we know the data will never
>>> change (of course this is already available using the
>>> fillfactor control).
>>
>> Well, there is also CPU overhead doing MVCC but there are a
>> few fundamental problems that must be overcome. The most
>> significant is that no useful table is always read only,
>> otherwise you could never load it.
>
> Yes, that's why I was proposing ALTER TABLE... SET ARCHIVE
> rather than CREATE TABLE... ARCHIVE.  (Although, for
> consistency, perhaps the CREATE TABLE would be allowed, it's
> just that you couldn't load anything into it until you did a
> ALTER TABLE... DROP ARCHIVE.)

If you want to squeeze the most out of it, CLUSTER would need to be
able to set the archive option too. (Otherwise you first rewrite the
table when you cluster it, and then rewrite it again when you archive
it.)


> Rather than writing in-place, perhaps the SET ARCHIVE would
> create a on-disk copy of the table.

Just like CLUSTER does now: create an on-disk copy first and swap the
relfilenodes of the files and flush the relcache.


> Of course this would demand
> you have twice the disk space available which may be prohibitive
> in a large warehouse.  On the other hand, I'm not sure if you
> would have a single humongous table that you'd SET ARCHIVE on,
> you might be as likely to archive on a weekly or yearly or
> whatever is manageable basis, along the lines of:
>
>   begin;
>   select * into foo_2006 from foo
>     where date_trunc('year', timestamp) = '2006-01-01';
>   delete from foo
>     where date_trunc('year', timestamp) = '2006-01-01';
>   alter table foo_2006 set archive;
>   alter table foo_2006 inherit foo;
>   commit;

Ideally you let most of it run outside a transaction:

create table foo_2006 ();
insert into foo_2006 SELECT * from foo where ....;
cluster foo_2006 on ... ARCHIVE;
begin; delete from foo where PK in select PK from foo_2006; alter table foo_2006 inherit foo;
commit;


> You're talking about the "no-WAL" concept?  Not quite the same
> thing I think, but perhaps complimentary to the ARCHIVE idea: I
> wouldn't expect an ARCHIVE table to need to generate any WAL
> entries as it would be read only.

The WAL gains come automatically when data isn't changed. But there
are additional advantages that can be obtained with archive tables:
- no need to vacuum them, not even for XID rollover (but you do need
to vacuum their entries in the catalogs)
- index entries are always valid so no need to check the heap (unless
the index is lossy)
- you can force the fillfactor to 100% regardless of the setting
- more agressive compression of toast tables

Jochem


Re: Data archiving/warehousing idea

From
"Simon Riggs"
Date:
On Thu, 2007-02-01 at 15:03 +1100, Chris Dunlop wrote:

> > A different approach discussed earlier involves greatly
> > restricting the way in which the table is used. This table
> > could only be written to if an exclusive lock is held; on
> > error or ABORT, the table is truncated.
> 
> You're talking about the "no-WAL" concept?  Not quite the same
> thing I think, but perhaps complimentary to the ARCHIVE idea: I
> wouldn't expect an ARCHIVE table to need to generate any WAL
> entries as it would be read only.

Setting an option to reduce the size of the row headers needs to be done
before its loaded, not after. If you mark a table as APPEND-only it
would be possible to save 4 bytes off the row header. 

Saving a further 8-bytes off the row header by dropping the xmin and
commandid fields isn't a very workable solution since those fields
provide visibility within a transaction and across transactions. You'd
end up with a table that could only be loaded by one transaction and
only touched by one command within that transaction. That would require
bookkeeping on the relation itself (in pg_class) and an
AccessExclusiveLock. In those circumstances you would be able to save on
writing WAL as well. You'd be basically saying that it could only be
loaded by a special utility and would be read-only once loaded.

I'm not crazy about those ideas, even though I think I suggested them
some time back. They're very special case and would probably require
many functions to handle multiple cases, so that additional complexity
and cost would effect all users.

My main doubt relates to the data you're storing. If the events you're
interested in have so little information associated with them that they
are a thin table then storing them at all is questionable. I've been
involved at the design stage of a number of data warehouses and the
amount of data eventually stored is typically < 10% of the number people
first thought of, with number of rows decreasing drastically and the
number of columns increasing slightly as people try to gain additional
value from their data storage. So row overhead should be less of a
concern.

It is possible to save on WAL by doing COPY LOCK or using a table with
different resilience characteristics, both ideas are already on the
TODO. There are also better, less intrusive ways of reducing data volume
and improving load performance.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




A more general approach (Re: Data archiving/warehousing idea)

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry:

> A different approach discussed earlier involves greatly restricting the
> way in which the table is used. This table could only be written to if an
> exclusive lock is held; on error or ABORT, the table is truncated.
> 
> The problem is that a lot of this looks like a hack and I haven't seen a
> very clean approach which has gone beyond basic brain dump.

A more radical variation of the "restricted-use archive table" approach
is storing all tuple visibility info in a separate file.

At first it seems to just add overhead, but for lots (most ? ) usecases
the separately stored visibility should be highly compressible, so for
example for bulk-loaded tables you could end up with one bit per page
saying that all tuples on this page are visible.

Also this could be used to speed up vacuums, as only the visibility
table needs to be scanned duting phase 1 of vacuum, and so tables with
localised/moving hotspots can be vacuumed withoutd scanning lots of
static data.

Also, storing the whole visibility info, but in a separate heap, lifts
all restrictions of the "restricted-use archive table" variant. 

And the compression of visibility info (mostly replacing per-tuple info
with per-page info) can be carried out by a separate vacuum-like
process.

And it has many of the benefits of static/RO tables, like space saving
and index-only queries. Index-only will of course need to get the
visibility info from visibility heap, but if it is mostly heavily
compressed, it will be a lot cheaper than random access to data heap.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: A more general approach (Re: Data archiving/warehousing idea)

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2007-02-01 kell 14:38, kirjutas Hannu Krosing:
> Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry:
> 
> > A different approach discussed earlier involves greatly restricting the
> > way in which the table is used. This table could only be written to if an
> > exclusive lock is held; on error or ABORT, the table is truncated.
> > 
> > The problem is that a lot of this looks like a hack and I haven't seen a
> > very clean approach which has gone beyond basic brain dump.
> 
> A more radical variation of the "restricted-use archive table" approach
> is storing all tuple visibility info in a separate file.
> 
> At first it seems to just add overhead, but for lots (most ? ) usecases
> the separately stored visibility should be highly compressible, so for
> example for bulk-loaded tables you could end up with one bit per page
> saying that all tuples on this page are visible.
> 
> Also this could be used to speed up vacuums, as only the visibility
> table needs to be scanned duting phase 1 of vacuum, and so tables with
> localised/moving hotspots can be vacuumed withoutd scanning lots of
> static data.
> 
> Also, storing the whole visibility info, but in a separate heap, lifts
> all restrictions of the "restricted-use archive table" variant. 
> 
> And the compression of visibility info (mostly replacing per-tuple info
> with per-page info) can be carried out by a separate vacuum-like
> process.
> 
> And it has many of the benefits of static/RO tables, like space saving
> and index-only queries. Index-only will of course need to get the
> visibility info from visibility heap, but if it is mostly heavily
> compressed, it will be a lot cheaper than random access to data heap.

For tables with fixed-width tuples it can probably be extended to
support vertical fragmentation as well, to get DWH benefits similar to
http://monetdb.cwi.nl/ .

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com





Re: A more general approach (Re: Dataarchiving/warehousing idea)

From
"Simon Riggs"
Date:
On Thu, 2007-02-01 at 14:38 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, N, 2007-02-01 kell 13:24, kirjutas Gavin Sherry:
>
> > A different approach discussed earlier involves greatly restricting the
> > way in which the table is used. This table could only be written to if an
> > exclusive lock is held; on error or ABORT, the table is truncated.
> >
> > The problem is that a lot of this looks like a hack and I haven't seen a
> > very clean approach which has gone beyond basic brain dump.
>
> A more radical variation of the "restricted-use archive table" approach
> is storing all tuple visibility info in a separate file.
>
> At first it seems to just add overhead, but for lots (most ? ) usecases
> the separately stored visibility should be highly compressible, so for
> example for bulk-loaded tables you could end up with one bit per page
> saying that all tuples on this page are visible.
>
> Also this could be used to speed up vacuums, as only the visibility
> table needs to be scanned duting phase 1 of vacuum, and so tables with
> localised/moving hotspots can be vacuumed withoutd scanning lots of
> static data.
>
> Also, storing the whole visibility info, but in a separate heap, lifts
> all restrictions of the "restricted-use archive table" variant.
>
> And the compression of visibility info (mostly replacing per-tuple info
> with per-page info) can be carried out by a separate vacuum-like
> process.
>
> And it has many of the benefits of static/RO tables, like space saving
> and index-only queries. Index-only will of course need to get the
> visibility info from visibility heap, but if it is mostly heavily
> compressed, it will be a lot cheaper than random access to data heap.

I like that idea, as a non-default option, since in-line visibility is
important for OLTP applications.

This idea is sufficiently flexible to allow a range of use cases without
requiring a complete removal of functionality. Read-mostly is an
important use case for very large databases.

This is essentially the same thing as PhantomCommandId, just for the
whole tuple header. It's something that would go on pg_class easily,
just as WITH/WITHOUT OIDS has done.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: A more general approach (Re: Data archiving/warehousing idea)

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> A more radical variation of the "restricted-use archive table" approach
> is storing all tuple visibility info in a separate file.
> At first it seems to just add overhead, but for lots (most ? ) usecases
> the separately stored visibility should be highly compressible, so for
> example for bulk-loaded tables you could end up with one bit per page
> saying that all tuples on this page are visible.

The more you compress, the slower and more complicated it will be to
access the information.  I'd put my money on this being a net loss in
the majority of scenarios.
        regards, tom lane


Re: A more general approach (Re: Data archiving/warehousing idea)

From
Ron Mayer
Date:
Hannu Krosing wrote:
> ...is storing all tuple visibility info in a separate file.
> 
> At first it seems to just add overhead, but for lots (most ? ) usecases
> the separately stored visibility should be highly compressible, so for
> example for bulk-loaded tables you could end up with one bit per page
> saying that all tuples on this page are visible.

Seems you could do "one bit per page" compression even with
visibility data stored in the pages themselves.

I could imagine a table re-writing "vacuum freeze" that finds
pages with all data visible to everyone and packs them with
a single bit saying "everything here's frozen".

The penalty would be an expensive splitting of the page (and
who knows what evil locks would be needed) if an update is
ever done on those wholly frozen pages -- but we're talking
about read-mostly tables here so that tradeoff might not be bad.


Re: A more general approach (Re: Data archiving/warehousing idea)

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2007-02-01 kell 12:31, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > A more radical variation of the "restricted-use archive table" approach
> > is storing all tuple visibility info in a separate file.
> > At first it seems to just add overhead, but for lots (most ? ) usecases
> > the separately stored visibility should be highly compressible, so for
> > example for bulk-loaded tables you could end up with one bit per page
> > saying that all tuples on this page are visible.
> 
> The more you compress, the slower and more complicated it will be to
> access the information.  I'd put my money on this being a net loss in
> the majority of scenarios.

define "majority" :)

In real life it is often faster to access compressed information,
especially if it is stored in something like trie where compression and
indeked access are the same thing.

the most gain will of course come from bulk-loaded data, where the
"compressed" representation can just say something like "pages 1 to
200000 are all visible starting from transaction 5000 and so is first
half of page 200001, second half of 200001 and pages up to 250000 are
visible from trx 6000. 

In this case the visibility info will always stay in L1 cache and thus
be really fast, maybe even free if we account for cache reloads and
such.

But it may be better to still have a bitmap there for sake of simplicity
and have some of it be accessible from L2 cache (200000 pages times say
2 bits is still only 100kB bitmap for 1.6GB of data).

Of course there are cases where this approach is worse, sometimes much
woorse, than current one, but the possibility of independently
compressing visibility info and making some types of VACUUM vastly
cheaper may make it a net win in several cases. 

Also, for higly dynamic tables a separate visibility heap might also
speed up bitmap scans, as access to heap happens only for visible
tuples. 

This can also be one way to get rid of need to write full data tuples
thrice - first the original write, then commit bits and then deleted
bits. instead we can just write the bits once for a whole set of tuples.

Usually most of visibility info can be thrown out quite soon, as the
active transaction window advances, so the first level of "compression"
is just thtrowing out cmin/cmax and setting commit bit, then setting the
tuple to just "visible".

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Data archiving/warehousing idea

From
"Florian G. Pflug"
Date:
Jochem van Dieten wrote:
> On 2/1/07, Chris Dunlop wrote:
>> In maillist.postgres.dev, you wrote:
>> Rather than writing in-place, perhaps the SET ARCHIVE would
>> create a on-disk copy of the table.
> 
> Just like CLUSTER does now: create an on-disk copy first and swap the
> relfilenodes of the files and flush the relcache.

IIRC, cluster currently needs to take on exclusive lock of the table, thus
preventing any concurrent selects. I assume it would be the same for
"alter table ... set archive".
For a large readonly table - the ones that "set archive" would be used for -
rewriting the whole table might easily
take a few hours, if not days. Blocking reads for such a long time might
be unacceptable in a lot of environments, severely limiting the use-cases
for "alter table ... set archive"

I think that both "cluster" and "truncate" should in theory only need to
prevent concurrent updates, not concurrent selects. AFAIK, the reason they
need to take an exclusive lock is because there is no way to let other backend
see the "old" relfilenode entry in pg_class until the cluster/truncate commits.
So I believe that this limitation would first have to be removed, before a
"alter table ... set archive" would become really usefull...

Just my 0.02 eurocents.
greetings, Florian Pflug