Thread: (Updated) Table File Format

(Updated) Table File Format

From
"Michael Richards"
Date:
I need a little help on the format of the postgres tables.
I've got this wonderfully corrupted database where just about everything isfubar. I've tried a number of things to get
itback using postgres andrelated tools with no success. It looks like most of the data is there, butthere may be a
smallamount of corruption that's causing all kinds ofproblems.
 
I've broken down and begin development of a tool to allow examination of
thedata within the table files. This could actually be useful for recoveringand undoing changes (or at least until the
row-reusecode goes intoproduction).
 
I've been hacking the file format and trying to find stuff in the source
anddocs as much as possible, but here goes...
a) tuples cannot span multiple pages (yet).b) the data is not platform independant??? Ie the data from a sun
looksdifferentfrom an intel?
 
For every page, I see that the first 2 words are for the end of the tuplepointers and the beginning of the tuple data.
What are the next 2 words used for? In all my cases they appear to be set
to0x2000.
Following that I find the 2 word tuple pointers.
The first word appears to be the offset in the page where the tuple can be
found but the MSB has to be stripped off (haven't found it's function in the
source yet).
The second is the transactionid that, if comitted gives this tuple
visibility???

Are these tuple pointers always stored in order of last to first? Or should
I be loading and sorting them according to offset?
Now on to the tuple data... I have my tool to the point where it extracts
all the tuple data from the table, but I haven't been able to find the place
in the postgres source that explains the format. I assume a tuple contains a
number of attributes (referencing pg_attribute). Those not found in the
tuple would be assumed to be NULL.
Since I'm ignoring transaction ids right now, I'm planning on extracting
all
the tuple and ordering them by oid so you can see all the comitted and
uncomitted changes. I may even make it look good once I've recovered my
data...
-Michael





Re: (Updated) Table File Format

From
Tom Lane
Date:
"Michael Richards" <miker@interchange.ca> writes:
>  Following that I find the 2 word tuple pointers.
> The first word appears to be the offset in the page where the tuple can be
> found but the MSB has to be stripped off (haven't found it's function in the
> source yet).
> The second is the transactionid that, if comitted gives this tuple
> visibility???

No, offset and length --- there is also a 2-bit flags field.  Look at
the page and item declarations in src/include/storage/

Someone else was recently working on a bit-level dump tool, but I've
forgotten who.
        regards, tom lane


Re: (Updated) Table File Format

From
Hannu Krosing
Date:
Michael Richards wrote:
> 
>  I need a little help on the format of the postgres tables.
> 
>  I've got this wonderfully corrupted database where just about everything is
>  fubar. I've tried a number of things to get it back using postgres and
>  related tools with no success. It looks like most of the data is there, but
>  there may be a small amount of corruption that's causing all kinds of
>  problems.

Find attached a python script that I used to get deleted (actually all
;) 
records from a database table.

It was not corrupted, just a simple programming error in client software 
had deleted more than needed. 

Fortunately it was not vacuumed so the data itself (info for web-based
paper 
postcard sending system) was there 

It works as-is only for my table as the field extraction code is
hard-coded, but 
it should be quite easy to modify for your needs

It worked 1 year ago probably on 6.4.x . I hope that the structure had
not 
changed since.

sendcard.py is the actual script used, pgtabdump.py is a somewhat
cleaned-up version


---------------
Hannu#!/usr/bin/python

import sys,os,struct,string

site_base = "/var/lib/pgsql/base"
db_name = "betest"
table_name = "test"

db_path = os.path.join(site_base,db_name)
table_path = os.path.join(db_path,table_name)

page_size = 8*1024

def strbits(s,len):   bits = []   while s:       c = s[0]       s = s[1:]       b = struct.unpack('B',c)[0]       for i
inrange(8):           if b & (1<<i): bits.append(1)           else: bits.append(0)   return
string.join(map(str,bits),'')[:len]


class table_page:   "class to represent a database table page"   def __init__(self,fd,page_nr):
fd.seek(page_nr*page_size)      self.rawdata = fd.read(page_size)       self.lower,\       self.upper,\
self.special,\      self.opaque = struct.unpack('HHHH',self.rawdata[:8])       self.item_pointers=[]
self.items=[]      for i in range(8,self.lower,4):           rawItemIdData = self.rawdata[i:i+4]
ItemIdData_I32= struct.unpack('L',rawItemIdData)[0]           if not ItemIdData_I32: break           lp_len =
int(ItemIdData_I32>> 17)           lp_flags = int((ItemIdData_I32 >> 15) & 3)           lp_off = int(ItemIdData_I32 &
0x7fff)
           self.item_pointers.append(hex(ItemIdData_I32),lp_off,lp_flags,lp_len)           rawItemData =
self.rawdata[lp_off:lp_off+lp_len]          t_oid  = struct.unpack('L', rawItemData[ 0: 4])[0]           t_001  =
struct.unpack('L',rawItemData[ 4: 8])[0]           t_002  = struct.unpack('L', rawItemData[ 8:12])[0]           t_xmin
=struct.unpack('L', rawItemData[12:16])[0]           t_xmax = struct.unpack('L', rawItemData[16:20])[0]
t_ctid= struct.unpack('LH', rawItemData[20:26])             t_fcnt = struct.unpack('H', rawItemData[26:28])[0]
t_xxx  = struct.unpack('2B', rawItemData[28:30])             t_doff = struct.unpack('B', rawItemData[30:31])[0]
 t_mask = strbits(rawItemData[31:t_doff],t_fcnt)   
 
#            t_mask = rawItemData[31:t_doff]           t_data = rawItemData[t_doff:]
self.items.append(t_oid,t_001,t_002,t_xmin,t_xmax,t_ctid,t_fcnt,t_xxx,t_doff,t_mask,t_data)  def __str__(self):
strval= [str((self.lower, self.upper, self.special, self.opaque))]
strval.append(string.join(map(str,self.item_pointers),'\n'))      strval.append(string.join(map(repr,self.items),'\n'))
     return string.join(strval,'\n------------------\n')  
 


if __name__=="__main__":   print '# dumping %s' % table_path   fd = open(table_name)   page = table_page(fd,0)   print
page
#!/usr/bin/python

import sys,os,struct,string

table_name = "sendcard"

page_size = 8*1024

def strbits(s,len):   bits = []   while s:       c = s[0]       s = s[1:]       b = struct.unpack('B',c)[0]       for i
inrange(8):           if b & (1<<i): bits.append(1)           else: bits.append(0)   bits =
string.join(map(str,bits[:len]),'')   return bits
 

"""
rapos=> \d sendcard
Table    = sendcard
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| card_id                          | int4 default nextval ( 'card_id_ |     4 |
| card_name                        | text                             |   var |
| recipient_name                   | text                             |   var |
| recipient_street_name            | text                             |   var |
| recipient_house_nr               | text                             |   var |
| recipient_apartment_nr           | text                             |   var |
| recipient_city                   | text                             |   var |
| recipient_village                | text                             |   var |
| recipient_state                  | text                             |   var |
| recipient_zip_code               | text                             |   var |
| sender_name                      | text                             |   var |
| sender_email                     | text                             |   var |
| message                          | text                             |   var |
| bank_ref_id                      | text                             |   var |
| delivery_nr                      | text                             |   var |
| delivery_route                   | text                             |   var |
| sender_remote_addr               | text                             |   var |
| card_cost                        | float8                           |     8 |
| cookie_id                        | int4 default nextval ( 'cookie_i |     4 |
| is_payed                         | text                             |   var |
| printing_date                    | datetime                         |     8 |
| delivery_date                    | date                             |     4 |
| payment_date                     | date                             |     4 |
| entered_at                       | timestamp default now ( )        |     4 |
| send_bill_to                     | text                             |   var |
+----------------------------------+----------------------------------+-------+
"""
create_statement = """\
create table savecard(    u_oid                  int,   u_xmin                 int,   u_xmax                 int,
card_id               int4,   card_name              text,   recipient_name         text,   recipient_street_name
text,  recipient_house_nr     text,   recipient_apartment_nr text,   recipient_city         text,   recipient_village
  text,   recipient_state        text,   recipient_zip_code     text,   sender_name            text,   sender_email
     text,   message                text,   bank_ref_id            text,   delivery_nr            text,
delivery_route        text,   sender_remote_addr     text 
 
);
copy savecard from stdin;
"""  
def bin2text(buf,offset):   offset = ((offset + 3)/4)*4   len = int(struct.unpack('L',buf[offset:offset+4])[0])   str =
buf[offset+4:offset+len]  newoff = offset+len   return str,newoff  
 
class table_page:   "class to represent a database table page"   def __init__(self,fd,page_nr):
fd.seek(page_nr*page_size)      self.rawdata = fd.read(page_size)       self.lower,\       self.upper,\
self.special,\      self.opaque = struct.unpack('HHHH',self.rawdata[:8])       self.item_pointers=[]
self.items=[]      for i in range(8,self.lower,4):           rawItemIdData = self.rawdata[i:i+4]
ItemIdData_I32= struct.unpack('L',rawItemIdData)[0]           if not ItemIdData_I32: break           lp_len =
int(ItemIdData_I32>> 17)           lp_flags = int((ItemIdData_I32 >> 15) & 3)           lp_off = int(ItemIdData_I32 &
0x7fff)
           self.item_pointers.append(hex(ItemIdData_I32),lp_off,lp_flags,lp_len)           rawItemData =
self.rawdata[lp_off:lp_off+lp_len]          t_oid  = struct.unpack('L', rawItemData[ 0: 4])[0]           t_001  =
struct.unpack('L',rawItemData[ 4: 8])[0]           t_002  = struct.unpack('L', rawItemData[ 8:12])[0]           t_xmin
=struct.unpack('L', rawItemData[12:16])[0]           t_xmax = struct.unpack('L', rawItemData[16:20])[0]
t_ctid= struct.unpack('LH', rawItemData[20:26])             t_fcnt = struct.unpack('H', rawItemData[26:28])[0]
t_xxx  = struct.unpack('2B', rawItemData[28:30])             t_doff = struct.unpack('B', rawItemData[30:31])[0]
 t_mask = strbits(rawItemData[31:t_doff],t_fcnt)              t_data = rawItemData[t_doff:]           id =
int(struct.unpack('L',t_data[:4])[0])          values = [int(t_oid),int(t_xmin),int(t_xmax),id]           texts = []
      offset = 4           for c in t_mask[1:17]:               if not int(c):                   values.append('')
            continue               text,offset = bin2text(t_data,offset)               values.append(repr(text))
  values=values+texts           self.items.append(values)
 
#            self.items.append(t_oid,t_001,t_002,t_xmin,t_xmax,t_ctid,t_fcnt,t_xxx,t_doff,t_mask,t_data)   def
__str__(self):
#        strval = [str((self.lower, self.upper, self.special, self.opaque))]
#        strval.append(string.join(map(str,self.item_pointers),'\n'))
#        strval.append(string.join(map(repr,self.items),'\n'))
#        return string.join(strval,'\n------------------\n')         rows = []       for record in self.items:
rows.append(string.join(map(str,record),'\t'))       return string.join(rows,'\n')
 

if __name__=="__main__":
#    print '# dumping %s' % table_name   print create_statement   fd = open(table_name)    page_nr = 0   while 1:
  try:              page = table_page(fd,page_nr)           print page       except:           break       page_nr =
page_nr+ 1   print '\\.\n' 

Re: (Updated) Table File Format

From
"Michael Richards"
Date:
Okay,

Where would I find a definition of the tuple data? I didn't see anything
promising in include/storage?

I've found a definition for the page inside pagebuf.h That clears up all the
page stuff. I'm still having a little trouble decoding the tuple data
within. Hannu Krosing sent me a python script to do the extract, but having
never seen a line of Python before in my life, I'm having a little trouble
with the actual tuple data. I can see where the actual transaction
visibility info is in the tuple data, but the actual data... nope. My
program (c++) is at the point where it will create tuple objects for every
block of "tuple" data within the page.

thanks
-Michael

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Michael Richards" <miker@interchange.ca>
Cc: <pgsql-hackers@postgresql.org>
Sent: Thursday, December 14, 2000 12:44 AM
Subject: Re: [HACKERS] (Updated) Table File Format


> "Michael Richards" <miker@interchange.ca> writes:
> >  Following that I find the 2 word tuple pointers.
> > The first word appears to be the offset in the page where the tuple can
be
> > found but the MSB has to be stripped off (haven't found it's function in
the
> > source yet).
> > The second is the transactionid that, if comitted gives this tuple
> > visibility???
>
> No, offset and length --- there is also a 2-bit flags field.  Look at
> the page and item declarations in src/include/storage/
>
> Someone else was recently working on a bit-level dump tool, but I've
> forgotten who.
>
> regards, tom lane



Re: (Updated) Table File Format

From
Hannu Krosing
Date:
Michael Richards wrote:
> 
> Okay,
> 
> Where would I find a definition of the tuple data? I didn't see anything
> promising in include/storage?
> 
> I've found a definition for the page inside pagebuf.h That clears up all the
> page stuff. I'm still having a little trouble decoding the tuple data
> within. Hannu Krosing sent me a python script to do the extract, but having
> never seen a line of Python before in my life, I'm having a little trouble
> with the actual tuple data. I can see where the actual transaction
> visibility info is in the tuple data, but the actual data... nope. My
> program (c++) is at the point where it will create tuple objects for every
> block of "tuple" data within the page.

IIRC, the data field format for individual fields is the same as defined
in 
the back-end/front-end protocol for binary cursors.

if there are any NULL fields in the record then there is a flag
somewhere in 
the tuple header and a bitmap of N*32 bits (N=no_of_fields/32) for
missing .
It is possible that there is no flag and you must deduce the presence of 
bitmap from the tuple-header length, im not sure which way it was.

The actual fields in a table and their order must be extracted from
pg_class 
and pg_attribute tables.

------------
Hannu