Info on Data Storage - Mailing list pgsql-hackers

From Thomas Lockhart
Subject Info on Data Storage
Date
Msg-id 376C399B.93BE95D5@alumni.caltech.edu
Whole thread Raw
In response to Re: [HACKERS] Savepoints...  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
istm that this discussion and the one on the 1GB limit on table
segments could form the basis for a missing chapter on "Data Storage"
in the Admin Guide. Would someone (other than Vadim, who we need to
keep coding! :) please keep following this and related threads and
extract the info for the Admin Guide chapter? It doesn't need to be
very long, perhaps just suggesting how to calculate table storage
size, discussing upper limits (e.g. 32-bit OID), and describing the
table segmentation scheme. There is already a chapter (with more
detail than the AG needs) in the Developer's Guide which should be
updated too.

Anyway, both chapters are enclosed; the originals are also in doc/src/sgml/{storage,page}.sgml)
All we really need is the info, and I can do the markup if whoever
picks this up doesn't feel comfortable with trying the SGML markup.

Volunteers appreciated...
                   - Thomas

> > > To have them I need to add tuple id (6 bytes) to heap tuple
> > > header. Are there objections? Though it's not good to increase
> > > tuple header size, subj is, imho, very nice feature...
> > Gee, that's a lot of overhead.  We would go from 40 bytes ->46 bytes.
> 40? offsetof(HeapTupleHeaderData, t_bits) is 31...
> Well, seems that we can remove 5 bytes from tuple header.
> 1. t_hoff (1 byte) may be computed - no reason to store it.
> 2. we need in both t_cmin and t_cmax only when tuple is updated
>    by the same xaction as it was inserted - in such cases we
>    can put delete command id (t_cmax) to t_xmax and set
>    flag HEAP_XMAX_THE_SAME (as t_xmin), in all other cases
>    we will overwrite insert command id with delete command id
>    (no one is interested in t_cmin of committed insert xaction)
>    -> yet another 4 bytes (sizeof command id).
> If now we'll add 6 bytes to header then
> offsetof(HeapTupleHeaderData, t_bits) will be 32 and for
> no-nulls tuples there will be no difference at all
> (with/without additional 6 bytes), due to double alignment
> of header. So, the choice is: new feature or more compact
> (than current) header for tuples with nulls.

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California<Chapter Id="storage">
<Title>Disk Storage</Title>

<Para>
This section needs to be written. Some information is in the FAQ. Volunteers?
- thomas 1998-01-11
</Para>

</Chapter>
<chapter id="page">

<title>Page Files</title>

<abstract>
<para>
A description of the database file default page format.
</para>
</abstract>

<para>
This section provides an overview of the page format used by <productname>Postgres</productname>
classes.  User-defined access methods need not use this page format.
</para>

<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits.  In addition, the term
<firstterm>item</firstterm>
refers to data which is stored in <productname>Postgres</productname> classes.
</para>

<sect1>
<title>Page Structure</title>

<para>
The following table shows how pages in both normal <productname>Postgres</productname> classesand
<productname>Postgres</productname>index
 
classes (e.g., a B-tree index) are structured.

<table tocentry="1">
<title>Sample Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="1">
<thead>
<row>
<entry>
Item
</entry>
<entry>
Description
</entry>
</row>
</thead>

<tbody>

<row>
<entry>
itemPointerData
</entry>
</row>

<row>
<entry>
filler
</entry>
</row>

<row>
<entry>
itemData...
</entry>
</row>

<row>
<entry>
Unallocated Space
</entry>
</row>

<row>
<entry>
ItemContinuationData
</entry>
</row>

<row>
<entry>
Special Space
</entry>
</row>

<row>
<entry>
``ItemData 2''
</entry>
</row>

<row>
<entry>
``ItemData 1''
</entry>
</row>

<row>
<entry>
ItemIdData
</entry>
</row>

<row>
<entry>
PageHeaderData
</entry>
</row>

</tbody>
</tgroup>
</table>
</para>

<!--
.\" Running
.\" .q .../bin/dumpbpages
.\" or
.\" .q .../src/support/dumpbpages
.\" as the postgres superuser
.\" with the file paths associated with
.\" (heap or B-tree index) classes,
.\" .q .../data/base/<database-name>/<class-name>,
.\" will display the page structure used by the classes.
.\" Specifying the
.\" .q -r
.\" flag will cause the classes to be
.\" treated as heap classes and for more information to be displayed.
-->

<para>
The first 8 bytes of each page consists of a page header
(PageHeaderData).
Within the header, the first three 2-byte integer fields
(<firstterm>lower</firstterm>,
<firstterm>upper</firstterm>,
and
<firstterm>special</firstterm>)
represent byte offsets to the start of unallocated space, to the end
of unallocated space, and to the start of <firstterm>special space</firstterm>.
Special space is a region at the end of the page which is allocated at
page initialization time and which contains information specific to an
access method.  The last 2 bytes of the page header,
<firstterm>opaque</firstterm>,
encode the page size and information on the internal fragmentation of
the page.  Page size is stored in each page because frames in the
buffer pool may be subdivided into equal sized pages on a frame by
frame basis within a class.  The internal fragmentation information is
used to aid in determining when page reorganization should occur.
</para>

<para>
Following the page header are item identifiers
(<firstterm>ItemIdData</firstterm>).
New item identifiers are allocated from the first four bytes of
unallocated space.  Because an item identifier is never moved until it
is freed, its index may be used to indicate the location of an item on
a page.  In fact, every pointer to an item
(<firstterm>ItemPointer</firstterm>)
created by <productname>Postgres</productname> consists of a frame number and an index of an item
identifier.  An item identifier contains a byte-offset to the start of
an item, its length in bytes, and a set of attribute bits which affect
its interpretation.
</para>

<para>
The items themselves are stored in space allocated backwards from
the end of unallocated space.  Usually, the items are not interpreted.
However when the item is too long to be placed on a single page or
when fragmentation of the item is desired, the item is divided and
each piece is handled as distinct items in the following manner.  The
first through the next to last piece are placed in an item
continuation structure
(<firstterm>ItemContinuationData</firstterm>).
This structure contains
itemPointerData
which points to the next piece and the piece itself.  The last piece
is handled normally.
</para>
</sect1>

<sect1>
<title>Files</title>

<para>
<variablelist>
<varlistentry>
<term>
<filename>data/</filename>
</term>
<listitem>
<para>
Location of shared (global) database files.
</para>
</listitem>
</varlistentry>

<varlistentry>
<term>
<filename>data/base/</filename>
</term>
<listitem>
<para>
Location of local database files.
</para>
</listitem>
</varlistentry>

</variablelist>
</para>
</sect1>

<sect1>
<title>Bugs</title>

<para>
The page format may change in the future to provide more efficient
access to large objects.
</para>

<para>
This section contains insufficient detail to be of any assistance in
writing a new access method.
</para>
</sect1>
</chapter>

pgsql-hackers by date:

Previous
From: Bernard Frankpitt
Date:
Subject: Re: [HACKERS] has anybody else used r-tree indexes in 6.5?
Next
From: Thomas Lockhart
Date:
Subject: Re: [HACKERS] tables > 1 gig