Thread: pg_rawdump

pg_rawdump

From

"Stephen R. van den Berg"

Date:

19 October 2010, 18:42:56

I spent yesterday writing a new tool pg_rawdump (which will be released as
open source in due course), which takes the table files in an arbitrary
pgsql database, and is able to transform those back into tables (including
toast values).

In the course of doing this (a customer needed it because he only had a
copy of those files, the pg_clog etc. dirs were lost), I noticed that
there are two problems which complicate recovery (needlessly):
a. It takes some guesswork to find out which file corresponds with  which table.
b. In order to recover the table content, it is (obviously) necessary  to correlate with the original table definition,
whichobviously  is lost with the rest of the information.
 

In order to simplify recovery at this point (enormously), it would
have been very helpful (at almost negligible cost), to have the name
of the table, the name of the columns, and the types of the
columns available.

Why don't we insert that data into the first page of a regular table
file after in the special data area?

I'd be willing to create a patch for that (should be pretty easy),
if nobody considers it to be a bad idea.
-- 
Stephen.

Re: pg_rawdump

From

Greg Stark

Date:

19 October 2010, 19:04:05

On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> In order to simplify recovery at this point (enormously), it would
> have been very helpful (at almost negligible cost), to have the name
> of the table, the name of the columns, and the types of the
> columns available.
>
> Why don't we insert that data into the first page of a regular table
> file after in the special data area?
>
> I'd be willing to create a patch for that (should be pretty easy),
> if nobody considers it to be a bad idea.

There isn't necessarily one value for these attributes.  You can
rename columns and that rename may succeed and commit or fail and
rollback. You can drop or add columns and some rows will have or not
have the added columns at all. You could even add a column, insert
some rows, then abort -- all in a transaction. So some (aborted) rows
will have extra columns that aren't even present in the current table
definition.

All this isn't to say the idea you're presenting is impossible or a
bad idea. If this meta information was only a hint for forensic
purposes and you take into account these caveats it might still be
useful. But I'm not sure how useful. I mean, you can't really decipher
everything properly without the data in the catalog -- and you have to
premise this on the idea that you've lost everything in the catalog
but not the data in other tables. Which seems like a narrow use case.

-- 
greg

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

19 October 2010, 19:13:16

Greg Stark wrote:
>On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>> In order to simplify recovery at this point (enormously), it would
>> have been very helpful (at almost negligible cost), to have the name
>> of the table, the name of the columns, and the types of the
>> columns available.

>> Why don't we insert that data into the first page of a regular table
>> file after in the special data area?

>> I'd be willing to create a patch for that (should be pretty easy),
>> if nobody considers it to be a bad idea.

>There isn't necessarily one value for these attributes.  You can
>rename columns and that rename may succeed and commit or fail and
>rollback. You can drop or add columns and some rows will have or not
>have the added columns at all. You could even add a column, insert
>some rows, then abort -- all in a transaction. So some (aborted) rows
>will have extra columns that aren't even present in the current table
>definition.

True.

>All this isn't to say the idea you're presenting is impossible or a
>bad idea. If this meta information was only a hint for forensic
>purposes and you take into account these caveats it might still be
>useful.

This is exactly what I meant it for.  It would contain the most
recent alter table state (give or take some delay due to cache
flushes).

> But I'm not sure how useful. I mean, you can't really decipher
>everything properly without the data in the catalog -- and you have to
>premise this on the idea that you've lost everything in the catalog
>but not the data in other tables. Which seems like a narrow use case.

It happens, more often than you'd think.  My client had it, I've
seen numerous google hits which show the same.
-- 
Stephen.

Be braver.  You cannot cross a chasm in two small jumps.

Re: pg_rawdump

From

Tom Lane

Date:

19 October 2010, 19:54:14

"Stephen R. van den Berg" <srb@cuci.nl> writes:
> In order to simplify recovery at this point (enormously), it would
> have been very helpful (at almost negligible cost), to have the name
> of the table, the name of the columns, and the types of the
> columns available.

> Why don't we insert that data into the first page of a regular table
> file after in the special data area?

(1) it wouldn't necessarily fit

(2) what are you going to do to maintain it during ALTER TABLE?

(3) if there are any custom types involved, you're still lost.
        regards, tom lane

Re: pg_rawdump

From

Roberto Mello

Date:

19 October 2010, 21:08:53

On Tue, Oct 19, 2010 at 6:13 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> Greg Stark wrote:
>
>> But I'm not sure how useful. I mean, you can't really decipher
>>everything properly without the data in the catalog -- and you have to
>>premise this on the idea that you've lost everything in the catalog
>>but not the data in other tables. Which seems like a narrow use case.
>
> It happens, more often than you'd think.  My client had it, I've
> seen numerous google hits which show the same.

It happened to us recently when a customer had disk issues, and we
were able to get the table files back through forensics, but the
control files were not in good shape, and the cluster wouldn't start.

A tool like Stephen is proposing would most likely have helped us
recover at least some or most of the data, I would hope.

Roberto

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

20 October 2010, 04:13:53

Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> In order to simplify recovery at this point (enormously), it would
>> have been very helpful (at almost negligible cost), to have the name
>> of the table, the name of the columns, and the types of the
>> columns available.

>> Why don't we insert that data into the first page of a regular table
>> file after in the special data area?

>(1) it wouldn't necessarily fit

Three viable options (pick any one, depending on other criteria):
a. If it doesn't fit at first, drop columnnames, and try again.
b. If necessary extend it into the special data area of the following page  (repeat until you stored everything).
c. Simply put in what fits and discard the rest.

Please note that the information in there is purely informational
and of a best-effort nature.  It is not required for regular operation.
It should have close to no performance impact in the normal use case.
It is meant to help with forensics if the catalog is damaged or lost,
and you still want to attempt to recover most of the data contained
in this tablefile.

>(2) what are you going to do to maintain it during ALTER TABLE?

Simply pick a point in time where it will be blatantly overwritten.
Either at reception of the command, or at commit time, whatever
is more convenient to implement.  It's a best effort service, it
is not a problem if we get it wrong sometimes due to wildly
overlapping alter table/commit/rollback sequences.

>(3) if there are any custom types involved, you're still lost.

Yes, this is a not a complete solution, it's a valid attempt at
making forensics a lot easier in the common case at virtually
no cost to the running system.
-- 
Stephen.

Life is that brief interlude between nothingness and eternity.

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

20 October 2010, 05:44:36

Roberto Mello wrote:
>On Tue, Oct 19, 2010 at 6:13 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>> Greg Stark wrote:
>>>premise this on the idea that you've lost everything in the catalog
>>>but not the data in other tables. Which seems like a narrow use case.

>> It happens, more often than you'd think. ??My client had it, I've
>> seen numerous google hits which show the same.

>It happened to us recently when a customer had disk issues, and we

It usually happens when there are disk issues, that's exactly what it is for.

>A tool like Stephen is proposing would most likely have helped us
>recover at least some or most of the data, I would hope.

Well, because the customer could recreate (within reason) the original
table definitions, we were able to recover all of his data (12 tables,
including some toasted/compressed).

It's just that matching table and file, and subsequently figuring out
some missing columns which may have been added/removed later,
can be rather timeconsuming and could be made a lot easier (not necessarily
perfect) if that information would have been present in the first page of
a file.
-- 
Stephen.

Life is that brief interlude between nothingness and eternity.

Re: pg_rawdump

From

Tom Lane

Date:

20 October 2010, 11:29:10

"Stephen R. van den Berg" <srb@cuci.nl> writes:
> It's just that matching table and file, and subsequently figuring out
> some missing columns which may have been added/removed later,
> can be rather timeconsuming and could be made a lot easier (not necessarily
> perfect) if that information would have been present in the first page of
> a file.

So you've already moved the goalposts from what was claimed in your
prior message.  If the data is not maintained (with 100% reliability)
during ALTER TABLE, how are you going to do something like "figure out
missing columns"?

I can see the potential usefulness of a self-documenting table storage
format, but this proposal isn't that; it's just an unreliable kluge.
        regards, tom lane

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

20 October 2010, 11:45:02

Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> It's just that matching table and file, and subsequently figuring out
>> some missing columns which may have been added/removed later,
>> can be rather timeconsuming and could be made a lot easier (not necessarily
>> perfect) if that information would have been present in the first page of
>> a file.

>So you've already moved the goalposts from what was claimed in your
>prior message.  If the data is not maintained (with 100% reliability)
>during ALTER TABLE, how are you going to do something like "figure out
>missing columns"?

Most alter table operations are well thought through and rarely undone
(at least not on production databases).  This means that most tables
can be restored.

>I can see the potential usefulness of a self-documenting table storage
>format, but this proposal isn't that; it's just an unreliable kluge.

Restoring tables/databases from table storage only is, by definition,
an unreliable kludge.  I'm not opposed to making the definition storage
more robust, but, since the records in the table already have lost their
relation to the pg_clog records, and therefore it *already* is uncertain
which records were deleted and/or have the wrong number of columns, it
seems to be a needless waste of time and energy to provide more reliable
information about the column structure.

I know for a fact that those who have lost data in such a way, and are
faced with the option to have this "unreliable kludgy information"
available now, or wait for a few years/months until a reliable solution
is present; they would (in every single case) opt for the former and
get at least some (if not all) of their data back in a shorter amount
of time.
-- 
Stephen.

Life is that brief interlude between nothingness and eternity.

Re: pg_rawdump

From

Aidan Van Dyk

Date:

20 October 2010, 13:10:45

On Wed, Oct 20, 2010 at 10:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Stephen R. van den Berg" <srb@cuci.nl> writes:
>> It's just that matching table and file, and subsequently figuring out
>> some missing columns which may have been added/removed later,
>> can be rather timeconsuming and could be made a lot easier (not necessarily
>> perfect) if that information would have been present in the first page of
>> a file.
>
> So you've already moved the goalposts from what was claimed in your
> prior message.  If the data is not maintained (with 100% reliability)
> during ALTER TABLE, how are you going to do something like "figure out
> missing columns"?
>
> I can see the potential usefulness of a self-documenting table storage
> format, but this proposal isn't that; it's just an unreliable kluge.

If we're looking to have any sort of "out of catalog" documentation of
table storage format, what about just having a new relation fork that
just "appends" each and every change made to the table formats,
including ones rolled back, etc.  Make this relation fork append only,
and dump a completely new set of metadata to it each and every ALTER
TABLE.  This fork would never need to be read by PG, so a relation
fork might even be too much.  All you really need is a file you can
tie to a relation, and blindly append "data" to on create/alter
statements.

Sure, it will have more information than *needed*, but I can't see it
ever growing too big, and people doing forensics rarely complain about
having *too much* information available.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: pg_rawdump

From

Tom Lane

Date:

20 October 2010, 14:10:02

Aidan Van Dyk <aidan@highrise.ca> writes:
> On Wed, Oct 20, 2010 at 10:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I can see the potential usefulness of a self-documenting table storage
>> format, but this proposal isn't that; it's just an unreliable kluge.

> If we're looking to have any sort of "out of catalog" documentation of
> table storage format, what about just having a new relation fork that
> just "appends" each and every change made to the table formats,
> including ones rolled back, etc.

I thought of that too, but I'm not sure if it's going to help enough.
The trouble is that the information is only tied to the table itself
via file names.  In a fsck-recovery scenario, you may not have the
correct file names.  (One of the multiple problems with Stephen's
proposal is that the metadata would only be physically tied to the
first segment of a large table, and unless you can trust the file
names you won't be able to match it to the other segments.)

[ thinks for a bit... ]  Perhaps we could stick some sort of unique
ID into tables, which could be correlated to the same unique ID
appearing in a metadata fork.

[ thinks some more... ]  Of course, at this point you have to start
asking why the metadata fork should be thought to be any more reliable
than the system catalogs.

> Make this relation fork append only,
> and dump a completely new set of metadata to it each and every ALTER
> TABLE.

You can bet that there'll be somebody out there who whines because their
app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
grows forever.  I think we'd have to be a bit smarter than this.
        regards, tom lane

Re: pg_rawdump

From

Aidan Van Dyk

Date:

20 October 2010, 14:35:09

On Wed, Oct 20, 2010 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> [ thinks for a bit... ]  Perhaps we could stick some sort of unique
> ID into tables, which could be correlated to the same unique ID
> appearing in a metadata fork.

It would be awfully nice if the table name was in the first few bytes
of the segments ;-)

> [ thinks some more... ]  Of course, at this point you have to start
> asking why the metadata fork should be thought to be any more reliable
> than the system catalogs.

It's not.  But it's about data duplication.  If you've lost some
files, and others are found in lost+found, etc.  Or maybe you lost
only the "main" tablespace, and you have your big-honking-files from
another table-space around, etc.

The more copies of the data around, the more chance you have of being
able to get something usable from of it.

But we already have WAL archive as an external safe backup copy of *everything*.

Maybe the cost of those extra forks/duplicates is too much.  Maybe it
would be cheaper to try and parse the WAL archive, and just
specifically rebuild the system catalogs for the couple of times
people actually need this type of data?

Or maybe a query/function that ultra paranoid types can run daily,
etc, which would show the system catalog information about table
storage format that the could save safely instead of munging GB of WAL
logs when they want it...

> You can bet that there'll be somebody out there who whines because their
> app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
> grows forever.  I think we'd have to be a bit smarter than this.

Well, for bad habits, we have an equally bad solution ;-)   BEGIN;   ALTER TABLE bad_design RENAME TO
bad_design_too_old;  CREATE TABLE bad_design LIKE bad_design_too_old INCLUDING DEFAULTS 
CONSTRAINTS INDEXES;   INSERT INTO bad_design SELECT * FROM bad_design_too_old;   DROP TABLE bad_design_too_old;
COMMIT;

It's all about how much extra stuff do we want "around" to make
forensic/reconstruction type work easier when they can't go to the
documented, tried, tested, "normal restore from backup/WAL".

None?  Or as much as possible?  And what are the tradeoffs.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

20 October 2010, 18:31:03

Tom Lane wrote:
>Aidan Van Dyk <aidan@highrise.ca> writes:
>> If we're looking to have any sort of "out of catalog" documentation of
>> table storage format, what about just having a new relation fork that
>> just "appends" each and every change made to the table formats,
>> including ones rolled back, etc.

I presume that means that all tables changes are appended to a single
central file per database?  That would be a bad idea, because in the
typical problem scenario, losing this new catalogue, basically creates
the same problem.  It would be preferable to keep the information
tied in with the actual table(file) it concerns.

>[ thinks for a bit... ]  Perhaps we could stick some sort of unique
>ID into tables, which could be correlated to the same unique ID
>appearing in a metadata fork.

Ideal would be: put the table-oid inside the header of each page
(either in the official header, or in the special area).
This way even lost blocks can be correlated to the same table.
I'd still vote for the latest known table definition in the first
page.  It's by no means perfect, but it will help 99% of all
salvage attempts by an order of magnitude.

>[ thinks some more... ]  Of course, at this point you have to start
>asking why the metadata fork should be thought to be any more reliable
>than the system catalogs.

Quite.  Which is why I wanted the best-effort latest version of the
table description in the first page of the tablefile instead.

>> Make this relation fork append only,
>> and dump a completely new set of metadata to it each and every ALTER
>> TABLE.

>You can bet that there'll be somebody out there who whines because their
>app does lots and lots of repeated ALTER TABLEs, and so the metadata fork
>grows forever.  I think we'd have to be a bit smarter than this.

Which means we come full circle and have to conclude that doing anything
comprehensive is too invasive for normal operations; best-effort is
all a forensics operation wants or can hope for.
-- 
Stephen.

Life is that brief interlude between nothingness and eternity.

Re: pg_rawdump

From

Robert Haas

Date:

20 October 2010, 19:33:10

On Wed, Oct 20, 2010 at 5:30 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> Ideal would be: put the table-oid inside the header of each page
> (either in the official header, or in the special area).
> This way even lost blocks can be correlated to the same table.
> I'd still vote for the latest known table definition in the first
> page.  It's by no means perfect, but it will help 99% of all
> salvage attempts by an order of magnitude.

I don't think we should shrink the amount of usable space by 4 bytes
per block and break on-disk compatibility just to make it easier to
reconstruct corrupted tables.

I'm pretty dubious about the proposal to stuff an otherwise-useless
metapage in every heap, too.  If you have many small tables, you just
doubled your disk utilization - worse than that, maybe, if some of
them are empty.  If we needed a metapage anyway and had extra space to
play with, stuffing some useful forensic information in there might be
worthwhile, but I have a hard time thinking that forensics alone is a
sufficient justification for such a change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

21 October 2010, 04:06:27

Robert Haas wrote:
>On Wed, Oct 20, 2010 at 5:30 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>> Ideal would be: put the table-oid inside the header of each page
>> (either in the official header, or in the special area).
>> This way even lost blocks can be correlated to the same table.
>> I'd still vote for the latest known table definition in the first
>> page. ?It's by no means perfect, but it will help 99% of all
>> salvage attempts by an order of magnitude.

> and break on-disk compatibility just to make it easier to

If it's inserted in the "special" area, it will not break any
compatibility.

>I don't think we should shrink the amount of usable space by 4 bytes
>per block

Instead of putting it in every page, it could be inserted (say) once
every megabyte (if done in the special area) to avoid excessive
overhead.

>I'm pretty dubious about the proposal to stuff an otherwise-useless
>metapage in every heap, too.

The information is supposed to go in the special area, so it will not
be an extra page.

>  If you have many small tables, you just
>doubled your disk utilization 

For small tables, the table description typically is small as well,
so in the common case it all will still fit in one page.

>- worse than that, maybe, if some of
>them are empty.

An empty table does not contain any critical information which needs
to be restored (by definition :-); so the code that inserts the
table definition in the special area could easily be instructed *not*
to write out this information unless the table actually has entries.

>  If we needed a metapage anyway and had extra space to
>play with, stuffing some useful forensic information in there might be
>worthwhile, but I have a hard time thinking that forensics alone is a
>sufficient justification for such a change.

The "change" could easily be made backward compatible to all on disk
formats which support the special area pointer (I'm not sure how far
back that is).
-- 
Stephen.

For children with short attention spans: boomerangs that don't come back.

Re: pg_rawdump

From

Tom Lane

Date:

21 October 2010, 11:12:15

"Stephen R. van den Berg" <srb@cuci.nl> writes:
> Robert Haas wrote:
>> and break on-disk compatibility just to make it easier to

> If it's inserted in the "special" area, it will not break any
> compatibility.

I'll tell you what I really don't like about this proposal: we discuss
some scheme or other for taking over the "special space" in heap pages
at least once a year.  None of them have been particularly compelling
so far, but one may come along that is; especially given that we're now
trying to maintain on-disk compatibility across versions.  So I think
the opportunity cost of assigning a use to that space is mighty high.
I don't find this idea important enough to justify foreclosing future
uses for the special space.

The real bottom line is this: if you care enough about your data to
be willing to expend a large amount of effort on manual recovery
attempts, why didn't you have a decent backup scheme in place?
There are way too many scenarios where you'll have no hope of doing
any such manual recovery anyway.
        regards, tom lane

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

21 October 2010, 15:31:01

Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> If it's inserted in the "special" area, it will not break any
>> compatibility.

>I'll tell you what I really don't like about this proposal: we discuss
>some scheme or other for taking over the "special space" in heap pages
>at least once a year.  None of them have been particularly compelling
>so far, but one may come along that is; especially given that we're now
>trying to maintain on-disk compatibility across versions.  So I think
>the opportunity cost of assigning a use to that space is mighty high.
>I don't find this idea important enough to justify foreclosing future
>uses for the special space.

Well, I had (of course) thought of that, and the classical solution to
this is to specify a certain attribute based format in order not to
canabalise the space and block it for further other use.

I.e. in the special area, we could start using something like:
2-byte field length (including the length field), 1-byte identifier,
field content.

For the recovery information I'd like to reserve:
identifier: 00: table OID           01: table layout

>The real bottom line is this: if you care enough about your data to
>be willing to expend a large amount of effort on manual recovery
>attempts, why didn't you have a decent backup scheme in place?

Two obvious valid answers would be: Stupidity and/or ignorance,
sometimes a strain of bad luck.
I know it is a sad state of affairs, but not all users of postgresql
are equally knowledgable/intelligent/responsible.

>There are way too many scenarios where you'll have no hope of doing
>any such manual recovery anyway.

True.  It's all a matter of statistics.  Judging by the number of reports
I find by googling net-history, I'd have to conclude that the proposed
extra information would have helped more than half of them.
-- 
Stephen.

Re: pg_rawdump

From

Greg Stark

Date:

21 October 2010, 17:03:36

On Thu, Oct 21, 2010 at 11:30 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> For the recovery information I'd like to reserve:
> identifier: 00: table OID
>            01: table layout
>

So here's a proposal for something that could maybe be implemented. I
think I'm leaning against this currently as there are just too many
caveats even for this limited functionality. I'm more inclined to go
with the idea someone else proposed of dumping an audit log of all DDL
on the table or after any DDL dumping the create table statements as
pg_dump would generate them would to a separate fork.

But here goes just to give you an idea what I think is doable and how
limited it would be:

In this meta data object put:

table oid
table name
number of columns
array of typlen for those columns (-1 for varlena and size for fixed length)

That would have a maximum size of just over 6k which is too large for
a BUFSZ 4k database but fits on default 8k databases. It would have
enough information to be able to find all the table columns but not to
understand how to interpret the contents either what their name or
types.

Including the type brings in a new set of complications. Even if you
assume the built-in typoids never change there are going to be typoids
that correspond to user defined types. Without the names of those
types the oids are pretty useless.

Just the typoids would put you over 8k in the worst case and the names
would put you into the realm of needing arbitrary numbers of blocks
for even average cases. Simiarly including the column names would
require potentially many blocks.

--
greg

Re: pg_rawdump

From

Tom Lane

Date:

21 October 2010, 17:25:47

"Stephen R. van den Berg" <srb@cuci.nl> writes:
> Tom Lane wrote:
>> There are way too many scenarios where you'll have no hope of doing
>> any such manual recovery anyway.

> True.  It's all a matter of statistics.  Judging by the number of reports
> I find by googling net-history, I'd have to conclude that the proposed
> extra information would have helped more than half of them.

Uh, no, it would have helped whatever minuscule fraction of them had the
tools and the expertise to make use of the information.  This should not
be confused with a magic fix-it button.  If you've lost your system
catalogs you're facing a whole lot of work that will at best recover an
untrustworthy version of some of your data.  Most of the data-loss
reports I've seen appeared to come from people who wouldn't be capable
of doing such recovery work even if better tools were available.
        regards, tom lane

Re: pg_rawdump

From

"Kevin Grittner"

Date:

21 October 2010, 18:27:33

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Most of the data-loss reports I've seen appeared to come from
> people who wouldn't be capable of doing such recovery work even if
> better tools were available.
No doubt; but the recovery work often winds up in the hands of
people with more skills than those responsible for the loss. 
Whoever comes in to help with recovery is looking for every bit of
traction they can get.  You'd be amazed at some of the odd straws
people can grasp to help recover data.
I don't know how big the special area is, but if enough room could
be carved out to include even the relation ID or filename, it might
be a big help to someone.  I'm pretty skeptical about including
information about attributes, though.
-Kevin

Re: pg_rawdump

From

Robert Haas

Date:

21 October 2010, 19:57:50

On Thu, Oct 21, 2010 at 5:21 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Most of the data-loss reports I've seen appeared to come from
>> people who wouldn't be capable of doing such recovery work even if
>> better tools were available.
>
> No doubt; but the recovery work often winds up in the hands of
> people with more skills than those responsible for the loss.
> Whoever comes in to help with recovery is looking for every bit of
> traction they can get.  You'd be amazed at some of the odd straws
> people can grasp to help recover data.
>
> I don't know how big the special area is, but if enough room could
> be carved out to include even the relation ID or filename, it might
> be a big help to someone.  I'm pretty skeptical about including
> information about attributes, though.

Unfortunately, the use case for the relation ID or filename is much
thinner than the case for the column definitions.  You're less likely
to lose that information, and if you do lose it you can probably guess
by file size or by running strings on the data files.  The really hard
thing is to remember just exactly what columns you had in there, in
what order... and oh yeah there was that column we dropped.  But like
you, I'm pretty dubious about trying to store all that stuff.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

21 October 2010, 21:04:03

Kevin Grittner wrote:
>Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Most of the data-loss reports I've seen appeared to come from
>> people who wouldn't be capable of doing such recovery work even if
>> better tools were available.

>No doubt; but the recovery work often winds up in the hands of
>people with more skills than those responsible for the loss. 
>Whoever comes in to help with recovery is looking for every bit of
>traction they can get.  You'd be amazed at some of the odd straws
>people can grasp to help recover data.

And, obviously, the amount of time spent (by the professional) to
recover the data goes down a lot when more information is available;
which reduces cost and shortens the downtime to the sad sod that
lost the data in the first place.
-- 
Stephen.

Re: pg_rawdump

From

Bruce Momjian

Date:

21 October 2010, 21:56:37

Greg Stark wrote:
> On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> > In order to simplify recovery at this point (enormously), it would
> > have been very helpful (at almost negligible cost), to have the name
> > of the table, the name of the columns, and the types of the
> > columns available.
> >
> > Why don't we insert that data into the first page of a regular table
> > file after in the special data area?
> >
> > I'd be willing to create a patch for that (should be pretty easy),
> > if nobody considers it to be a bad idea.
> 
> There isn't necessarily one value for these attributes.  You can
> rename columns and that rename may succeed and commit or fail and
> rollback. You can drop or add columns and some rows will have or not
> have the added columns at all. You could even add a column, insert
> some rows, then abort -- all in a transaction. So some (aborted) rows
> will have extra columns that aren't even present in the current table
> definition.
> 
> All this isn't to say the idea you're presenting is impossible or a
> bad idea. If this meta information was only a hint for forensic
> purposes and you take into account these caveats it might still be
> useful. But I'm not sure how useful. I mean, you can't really decipher
> everything properly without the data in the catalog -- and you have to
> premise this on the idea that you've lost everything in the catalog
> but not the data in other tables. Which seems like a narrow use case.

I was thinking we could dump a flat file very 15 minutes into each
database directory that had recovery-useful information.  It wouldn't be
perfect, but would probably be sufficient for most forensics.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

21 October 2010, 21:59:31

Greg Stark wrote:
>On Thu, Oct 21, 2010 at 11:30 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>> For the recovery information I'd like to reserve:
>> identifier: 00: table OID
>> ? ? ? ? ? ?01: table layout

>So here's a proposal for something that could maybe be implemented. I
>think I'm leaning against this currently as there are just too many
>caveats even for this limited functionality. I'm more inclined to go
>with the idea someone else proposed of dumping an audit log of all DDL
>on the table or after any DDL dumping the create table statements as
>pg_dump would generate them would to a separate fork.

Yes, but that has two distinct downsides I'd like to avoid:
- It can grow uncontrollably in the case of someone using alter table on a (too) regular basis.
- It separates the data from the tablefile it pertains to (which could complicate recovery (a lot)).

>In this meta data object put:

>table oid
>table name
>number of columns
>array of typlen for those columns (-1 for varlena and size for fixed length)

[...]

>Including the type brings in a new set of complications. Even if you
>assume the built-in typoids never change there are going to be typoids
>that correspond to user defined types. Without the names of those
>types the oids are pretty useless.

>Just the typoids would put you over 8k in the worst case and the names
>would put you into the realm of needing arbitrary numbers of blocks
>for even average cases. Simiarly including the column names would
>require potentially many blocks.

All valid points/concerns.
But, let's approach this from the side of the forensics analist instead,
and see what information typically really would be needed (for argument's
sake, let's call the unfortunate sod that lost the catalog to his database
"the customer"):

The customer usually still has access to the developer, or some developer
documentation which documents which columns are used for what.  It would
most likely document most columns (especially the column names, to a
lesser extent, the column types), but might be lacking some of the
more recent changes which (unfortunately) were done on the live
database using alter table, and hadn't been documented properly yet
(I'm talking about typical real world cases I've encountered).

Which means that analist would primarily be concerned with getting back
the information of the column types and the column names.  The next
step would be to know about slight deviations from the documented
columns (dropped or added columns since the last revision of the docs).

Next are custom types.  Custom types are likely to be better documented,
hence the column name would usually be enough to recover the definition
of a custom type from the docs.

Assuming the above assumptions to be true in the common case, I would
propose to implement something along the following lines:

Entries for the special area in tables:

0007 00 tableoid        To be sprinkled in every            megabyte or so.

nnnn 01 00 cc tablename...    cc = number of columns

nnnn 01 01 aaaa bbbb cccc ...   aaaa/bbbb/cccc are column widths            FFFF for varlena widths
nnnn 01 02 aaaa bbbb cccc ...   aaaa/bbbb/cccc are typeoids per            column (is 16-bits wide enough
forthe common cases?  If not            disregard my suggestion and make            these 32-bit wide each)

nnnn 01 03 nn col1name nn col2name ...            nn = length of the column name            that follows
0000                End of filled special area, no            need to parse beyond here in the            current
page.

Whereas the 01 special area types could be present in any number
of pages.
If they are present, they shall be present starting at the
first page of the file, and possibly will be present in the
next following page(s), until a page is encountered without
them.

Multiple occurrences of 01 01, 01 02, or 01 03 shall be concatenated
to form the complete informationset.
-- 
Stephen.

Re: pg_rawdump

From

"Stephen R. van den Berg"

Date:

21 October 2010, 22:03:48

Bruce Momjian wrote:
>Greg Stark wrote:
>> On Tue, Oct 19, 2010 at 1:12 PM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>> > In order to simplify recovery at this point (enormously), it would
>> > have been very helpful (at almost negligible cost), to have the name
>> > of the table, the name of the columns, and the types of the
>> > columns available.

>> > Why don't we insert that data into the first page of a regular table
>> > file after in the special data area?

>I was thinking we could dump a flat file very 15 minutes into each
>database directory that had recovery-useful information.  It wouldn't be
>perfect, but would probably be sufficient for most forensics.

It would definitely be better than the current state.
But it still disconnects the information from the files they belong to (a bit).
From a cost/benifit ratio point of view, I'd still prefer to interlace
the information into the tablefiles (which also scales better in case of
numerous tables).
-- 
Stephen.

Table description in the data file (Re: pg_rawdump)

From

"Stephen R. van den Berg"

Date:

03 February 2015, 07:43:42

In relation to the talk and discussions at FOSDEM regarding
helping data recovery, I searched the archives for the
old thread after I performed my last recovery; for reference:

http://www.postgresql.org/message-id/20101019201223.GA15684@cuci.nl

I haven't checked yet if the proposed space there is still
available in the current disk format, but it might serve as
a starter and reminder to get the discussion going once more.
-- 
Stephen.

Re: Table description in the data file (Re: pg_rawdump)

From

Jim Nasby

Date:

04 February 2015, 05:48:26

On 2/3/15 1:43 AM, Stephen R. van den Berg wrote:
> In relation to the talk and discussions at FOSDEM regarding
> helping data recovery, I searched the archives for the
> old thread after I performed my last recovery; for reference:
>
> http://www.postgresql.org/message-id/20101019201223.GA15684@cuci.nl
>
> I haven't checked yet if the proposed space there is still
> available in the current disk format, but it might serve as
> a starter and reminder to get the discussion going once more.

Rather than trying to wedge this into a heap page, ISTM it'd be better 
to use a fork. Presumably if you're storing regular tuples that have the 
essential data from pg_class, pg_attribute (and maybe pg_type).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Table description in the data file (Re: pg_rawdump)

From

"Stephen R. van den Berg"

Date:

04 February 2015, 11:02:52

Jim Nasby wrote:
>Rather than trying to wedge this into a heap page, ISTM it'd be
>better to use a fork. Presumably if you're storing regular tuples
>that have the essential data from pg_class, pg_attribute (and maybe
>pg_type).

Well, three things:
- The information preferably is present in the same file with the pages; ideally upfront, but reachable through a
pointerwithin the file is acceptable.
 
- The information ideally is repeated at the beginning of each segment (separate file).
- The information can be in tuple form, but it is acceptable (or even preferable?) to store it as a cleartext SQL
CREATETABLE statement.  Not quite sure which one I'd prefer at this point.  The tuple version might be easier to do
machinerecovery with.  But in any case it needs needs database name, schema name, table name, column name, column type,
NULLconstraint per column.
 

The reason putting it in a separate fork is not ideal, is that
when doing recovery you might not have more than just the datafiles,
and thus it is not possible to determine which fork file
belongs to which datafile (the filenames might be lost, as well
as the directory association).
-- 
Stephen.