Thread: Savepoints...

Savepoints...

From

Vadim Mikheev

Date:

16 June 1999, 09:13:52

To have them I need to add tuple id (6 bytes) to heap tuple
header. Are there objections? Though it's not good to increase
tuple header size, subj is, imho, very nice feature...

Implementation is , hm, "easy":

- heap_insert/heap_delete/heap_replace/heap_mark4update will remember updated tid (and current command id) in relation
cacheand store previously updated tid (remembered in relation cache) in additional heap header tid;

- lmgr will remember command id when lock was acquired;
- for a savepoint we will just store command id when the savepoint was setted;
- when going to sleep due to concurrent the-same-row update, backend will store MyProc and tuple id in shmem hash
table.

When rolling back to a savepoint, backend will:

- release locks acquired after savepoint;
- for a relation updated after savepoint, get last updated tid from relation cache, walk through relation, set
HEAP_XMIN_INVALID/HEAP_XMAX_INVALIDin all tuples updated after savepoint and wake up concurrent writers blocked on
thesetuples (using shmem hash table mentioned above).

The last feature (waking up of concurrent writers) is most hard
part to implement. AFAIK, Oracle 7.3 was not able to do it.
Can someone comment is this feature implemented in Oracle 8.X,
other DBMSes?

Now about implicit savepoints. Backend will place them before
user statements execution. In the case of failure, transaction
state will be rolled back to the one before execution of query.
As side-effect, this means that we'll get rid of complaints
about entire transaction abort in the case of mistyping
causing abort due to parser errors...

Comments?

Vadim

Re: [HACKERS] Savepoints...

From

Bruce Momjian

Date:

16 June 1999, 10:01:02

> To have them I need to add tuple id (6 bytes) to heap tuple
> header. Are there objections? Though it's not good to increase 
> tuple header size, subj is, imho, very nice feature...

Gee, that's a lot of overhead.  We would go from 40 bytes ->46 bytes.

How is this different from the tid or oid?  Reading your description, I
see there probably isn't another way to do it.



--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: [HACKERS] Savepoints...

From

"Hiroshi Inoue"

Date:

16 June 1999, 23:19:06


> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev
> Sent: Wednesday, June 16, 1999 10:13 PM
> To: PostgreSQL Developers List
> Subject: [HACKERS] Savepoints...
> 
> 
> To have them I need to add tuple id (6 bytes) to heap tuple
> header. Are there objections? Though it's not good to increase 
> tuple header size, subj is, imho, very nice feature...
> 
> Implementation is , hm, "easy":
> 
> - heap_insert/heap_delete/heap_replace/heap_mark4update will
>   remember updated tid (and current command id) in relation cache
>   and store previously updated tid (remembered in relation cache)
>   in additional heap header tid;

> - lmgr will remember command id when lock was acquired;

Does this mean that many writing commands in a transaction 
require many command id-s to remember ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] Savepoints...

From

Vadim Mikheev

Date:

16 June 1999, 23:58:22

Hiroshi Inoue wrote:
> 
> > - lmgr will remember command id when lock was acquired;
> 
> Does this mean that many writing commands in a transaction
> require many command id-s to remember ?

Did you mean such cases:

begin;
...
update t set...;
...
update t set...;
...
end;

?

We'll remember command id for the first "update t" only
(i.e. for the first ROW EXCLUSIVE mode lock over table t).

Vadim

RE: [HACKERS] Savepoints...

From

"Hiroshi Inoue"

Date:

17 June 1999, 00:10:48


> -----Original Message-----
> From: root@sunpine.krs.ru [mailto:root@sunpine.krs.ru]On Behalf Of Vadim
> Mikheev
> Sent: Thursday, June 17, 1999 12:58 PM
> To: Hiroshi Inoue
> Cc: PostgreSQL Developers List
> Subject: Re: [HACKERS] Savepoints...
> 
> 
> Hiroshi Inoue wrote:
> > 
> > > - lmgr will remember command id when lock was acquired;
> > 
> > Does this mean that many writing commands in a transaction
> > require many command id-s to remember ?
> 
> Did you mean such cases:
>

Yes.
> begin;
> ...
> update t set...;
> ...
> update t set...;
> ...
> end;
> 
> ?
> 
> We'll remember command id for the first "update t" only
> (i.e. for the first ROW EXCLUSIVE mode lock over table t).
>

How to reduce lock counter for  ROW EXCLUSIVE mode lock 
over table t?


And more questions.

HEAP_MARKED_FOR_UPDATE state could be rollbacked ?

For example

..
[savepoint 1]
select .. from t1 where key=1 for update;
[savepoint 2]
select .. from t1 where key=1 for update;
[savepoint 3]
update t1 set .. where key=1;

Rollback to savepoint 3 OK ?
Rollback to savepoint 2 OK ?
Rollback to savepoint 1 OK ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] Savepoints...

From

Vadim Mikheev

Date:

17 June 1999, 01:38:47

Hiroshi Inoue wrote:
> 
> >
> > We'll remember command id for the first "update t" only
> > (i.e. for the first ROW EXCLUSIVE mode lock over table t).
> >
> 
> How to reduce lock counter for  ROW EXCLUSIVE mode lock
> over table t?

No reasons to do it for ROW EXCLUSIVE mode lock (backend releases
such locks only when commit/rollback[to savepoint]), but we have to
do it in some other cases - when we explicitly release acquired locks 
after scan/statement is done. And so, you're right: in these cases
we have to track lock acquisitions. Well, we'll add new arg to
LockAcquire (and other funcs; we have to do it anyway to implement 
NO WAIT, WAIT XXX secs locks) to flag lmgr that if the lock counter
is not 0 (for 0s - i.e. first lock acquisition - command id will be 
remembered by lmgr anyway) than this counter must be preserved in 
implicit savepoint. In the case of abort lock counters will be restored.
Space allocated in implicit savepoint will released.

All the above will work till there is no UNLOCK statement.

Thanks!

> 
> And more questions.
> 
> HEAP_MARKED_FOR_UPDATE state could be rollbacked ?

Yes. FOR UPDATE changes t_xmax and t_cmax.

Vadim

Re: [HACKERS] Savepoints...

From

Vadim Mikheev

Date:

17 June 1999, 03:54:28

Bruce Momjian wrote:
> 
> > To have them I need to add tuple id (6 bytes) to heap tuple
> > header. Are there objections? Though it's not good to increase
> > tuple header size, subj is, imho, very nice feature...
> 
> Gee, that's a lot of overhead.  We would go from 40 bytes ->46 bytes.

40? offsetof(HeapTupleHeaderData, t_bits) is 31...

Well, seems that we can remove 5 bytes from tuple header.

1. t_hoff (1 byte) may be computed - no reason to store it.
2. we need in both t_cmin and t_cmax only when tuple is updated  by the same xaction as it was inserted - in such cases
we  can put delete command id (t_cmax) to t_xmax and set  flag HEAP_XMAX_THE_SAME (as t_xmin), in all other cases  we
willoverwrite insert command id with delete command id  (no one is interested in t_cmin of committed insert xaction)
->yet another 4 bytes (sizeof command id).

If now we'll add 6 bytes to header then 
offsetof(HeapTupleHeaderData, t_bits) will be 32 and for
no-nulls tuples there will be no difference at all
(with/without additional 6 bytes), due to double alignment
of header. So, the choice is: new feature or more compact
(than current) header for tuples with nulls.

> 
> How is this different from the tid or oid?  Reading your description, I

t_ctid could be used but would require additional disk write.

> see there probably isn't another way to do it.

There is one - WAL. I'm thinking about it, but it's too long story -:)

BTW, additional tid in header would allow us to implement
RI/U constraints without rules: knowing what tuples were changed
we could just read these tuples and perform checks. This would be
faster and don't require to store deffered rule plans in memory.

I'm still like the idea of deffered rules, Jan - they allow
to implement much more complex constraints than RI/U ones.
Though, did you think about [deffered] statement level triggers 
implementation, Jan? You are the best one who could make it, 
because of they are children of overwrite system and PL.

Vadim

Re: [HACKERS] Savepoints...

From

Bruce Momjian

Date:

17 June 1999, 09:41:51

> Bruce Momjian wrote:
> > 
> > > To have them I need to add tuple id (6 bytes) to heap tuple
> > > header. Are there objections? Though it's not good to increase
> > > tuple header size, subj is, imho, very nice feature...
> > 
> > Gee, that's a lot of overhead.  We would go from 40 bytes ->46 bytes.
> 
> 40? offsetof(HeapTupleHeaderData, t_bits) is 31...

Yes, I saw this.  I even updated the FAQ to show a 32-byte overhead.

> Well, seems that we can remove 5 bytes from tuple header.

I was hoping you could do something like this.

> 1. t_hoff (1 byte) may be computed - no reason to store it.

Yes.

> 2. we need in both t_cmin and t_cmax only when tuple is updated
>    by the same xaction as it was inserted - in such cases we 
>    can put delete command id (t_cmax) to t_xmax and set
>    flag HEAP_XMAX_THE_SAME (as t_xmin), in all other cases
>    we will overwrite insert command id with delete command id
>    (no one is interested in t_cmin of committed insert xaction)
>    -> yet another 4 bytes (sizeof command id).

Good.

> 
> If now we'll add 6 bytes to header then 
> offsetof(HeapTupleHeaderData, t_bits) will be 32 and for
> no-nulls tuples there will be no difference at all
> (with/without additional 6 bytes), due to double alignment
> of header. So, the choice is: new feature or more compact
> (than current) header for tuples with nulls.

That's a tough one.  What do other DB's have for row overhead?

> > How is this different from the tid or oid?  Reading your description, I
> 
> t_ctid could be used but would require additional disk write.

OK, I understand.

> 
> > see there probably isn't another way to do it.
> 
> There is one - WAL. I'm thinking about it, but it's too long story -:)

OK.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Info on Data Storage

From

Thomas Lockhart

Date:

19 June 1999, 21:51:07

istm that this discussion and the one on the 1GB limit on table
segments could form the basis for a missing chapter on "Data Storage"
in the Admin Guide. Would someone (other than Vadim, who we need to
keep coding! :) please keep following this and related threads and
extract the info for the Admin Guide chapter? It doesn't need to be
very long, perhaps just suggesting how to calculate table storage
size, discussing upper limits (e.g. 32-bit OID), and describing the
table segmentation scheme. There is already a chapter (with more
detail than the AG needs) in the Developer's Guide which should be
updated too.

Anyway, both chapters are enclosed; the originals are also in doc/src/sgml/{storage,page}.sgml)
All we really need is the info, and I can do the markup if whoever
picks this up doesn't feel comfortable with trying the SGML markup.

Volunteers appreciated...
                   - Thomas

> > > To have them I need to add tuple id (6 bytes) to heap tuple
> > > header. Are there objections? Though it's not good to increase
> > > tuple header size, subj is, imho, very nice feature...
> > Gee, that's a lot of overhead.  We would go from 40 bytes ->46 bytes.
> 40? offsetof(HeapTupleHeaderData, t_bits) is 31...
> Well, seems that we can remove 5 bytes from tuple header.
> 1. t_hoff (1 byte) may be computed - no reason to store it.
> 2. we need in both t_cmin and t_cmax only when tuple is updated
>    by the same xaction as it was inserted - in such cases we
>    can put delete command id (t_cmax) to t_xmax and set
>    flag HEAP_XMAX_THE_SAME (as t_xmin), in all other cases
>    we will overwrite insert command id with delete command id
>    (no one is interested in t_cmin of committed insert xaction)
>    -> yet another 4 bytes (sizeof command id).
> If now we'll add 6 bytes to header then
> offsetof(HeapTupleHeaderData, t_bits) will be 32 and for
> no-nulls tuples there will be no difference at all
> (with/without additional 6 bytes), due to double alignment
> of header. So, the choice is: new feature or more compact
> (than current) header for tuples with nulls.

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California<Chapter Id="storage">
<Title>Disk Storage</Title>

<Para>
This section needs to be written. Some information is in the FAQ. Volunteers?
- thomas 1998-01-11
</Para>

</Chapter>
<chapter id="page">

<title>Page Files</title>

<abstract>
<para>
A description of the database file default page format.
</para>
</abstract>

<para>
This section provides an overview of the page format used by <productname>Postgres</productname>
classes.  User-defined access methods need not use this page format.
</para>

<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits.  In addition, the term
<firstterm>item</firstterm>
refers to data which is stored in <productname>Postgres</productname> classes.
</para>

<sect1>
<title>Page Structure</title>

<para>
The following table shows how pages in both normal <productname>Postgres</productname> classesand
<productname>Postgres</productname>index
 
classes (e.g., a B-tree index) are structured.

<table tocentry="1">
<title>Sample Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="1">
<thead>
<row>
<entry>
Item
</entry>
<entry>
Description
</entry>
</row>
</thead>

<tbody>

<row>
<entry>
itemPointerData
</entry>
</row>

<row>
<entry>
filler
</entry>
</row>

<row>
<entry>
itemData...
</entry>
</row>

<row>
<entry>
Unallocated Space
</entry>
</row>

<row>
<entry>
ItemContinuationData
</entry>
</row>

<row>
<entry>
Special Space
</entry>
</row>

<row>
<entry>
``ItemData 2''
</entry>
</row>

<row>
<entry>
``ItemData 1''
</entry>
</row>

<row>
<entry>
ItemIdData
</entry>
</row>

<row>
<entry>
PageHeaderData
</entry>
</row>

</tbody>
</tgroup>
</table>
</para>

<!--
.\" Running
.\" .q .../bin/dumpbpages
.\" or
.\" .q .../src/support/dumpbpages
.\" as the postgres superuser
.\" with the file paths associated with
.\" (heap or B-tree index) classes,
.\" .q .../data/base/<database-name>/<class-name>,
.\" will display the page structure used by the classes.
.\" Specifying the
.\" .q -r
.\" flag will cause the classes to be
.\" treated as heap classes and for more information to be displayed.
-->

<para>
The first 8 bytes of each page consists of a page header
(PageHeaderData).
Within the header, the first three 2-byte integer fields
(<firstterm>lower</firstterm>,
<firstterm>upper</firstterm>,
and
<firstterm>special</firstterm>)
represent byte offsets to the start of unallocated space, to the end
of unallocated space, and to the start of <firstterm>special space</firstterm>.
Special space is a region at the end of the page which is allocated at
page initialization time and which contains information specific to an
access method.  The last 2 bytes of the page header,
<firstterm>opaque</firstterm>,
encode the page size and information on the internal fragmentation of
the page.  Page size is stored in each page because frames in the
buffer pool may be subdivided into equal sized pages on a frame by
frame basis within a class.  The internal fragmentation information is
used to aid in determining when page reorganization should occur.
</para>

<para>
Following the page header are item identifiers
(<firstterm>ItemIdData</firstterm>).
New item identifiers are allocated from the first four bytes of
unallocated space.  Because an item identifier is never moved until it
is freed, its index may be used to indicate the location of an item on
a page.  In fact, every pointer to an item
(<firstterm>ItemPointer</firstterm>)
created by <productname>Postgres</productname> consists of a frame number and an index of an item
identifier.  An item identifier contains a byte-offset to the start of
an item, its length in bytes, and a set of attribute bits which affect
its interpretation.
</para>

<para>
The items themselves are stored in space allocated backwards from
the end of unallocated space.  Usually, the items are not interpreted.
However when the item is too long to be placed on a single page or
when fragmentation of the item is desired, the item is divided and
each piece is handled as distinct items in the following manner.  The
first through the next to last piece are placed in an item
continuation structure
(<firstterm>ItemContinuationData</firstterm>).
This structure contains
itemPointerData
which points to the next piece and the piece itself.  The last piece
is handled normally.
</para>
</sect1>

<sect1>
<title>Files</title>

<para>
<variablelist>
<varlistentry>
<term>
<filename>data/</filename>
</term>
<listitem>
<para>
Location of shared (global) database files.
</para>
</listitem>
</varlistentry>

<varlistentry>
<term>
<filename>data/base/</filename>
</term>
<listitem>
<para>
Location of local database files.
</para>
</listitem>
</varlistentry>

</variablelist>
</para>
</sect1>

<sect1>
<title>Bugs</title>

<para>
The page format may change in the future to provide more efficient
access to large objects.
</para>

<para>
This section contains insufficient detail to be of any assistance in
writing a new access method.
</para>
</sect1>
</chapter>