Re: [HACKERS] RE: [GENERAL] Transaction logging - Mailing list pgsql-hackers
From | jwieck@debis.com (Jan Wieck) |
---|---|
Subject | Re: [HACKERS] RE: [GENERAL] Transaction logging |
Date | |
Msg-id | m10IzGK-000EBQC@orion.SAPserv.Hamburg.dsh.de Whole thread Raw |
In response to | RE: [GENERAL] Transaction logging (Michael Davis <michael.davis@prevuenet.com>) |
Responses |
Re: [HACKERS] RE: [GENERAL] Transaction logging
|
List | pgsql-hackers |
Michael Davis wrote: > > What would it take to have transaction logging added to Postgres. I am a > c/c++ programmer and will consider contributing to the Postgres development > effort. I really like everything I see and read about Postgres. As a I spent some time on transaction logging since it's a feature I'm missing too. There are mainly two different transaction log mechanisms out. 1. Log queries sent to the backend. 2. Log images of inserted/updated rows and row ID's of deleted ones. The query level logging will write less information if queries usually affect a large number of rows. Unfortunately the extensibility of Postgres work's against this approach. There could be any number of user written functions who's results aren't reproduceable during recovery. And core features of Postgres itself would introduce the same problem. Have a sequence which is used to create default values for multiple tables, so that one ID is unique across them. Now two backends insert (with INSERT ... SELECT) concurrently into different tables using the same sequence. It's a classic race condition and it depends on context switching and page faults which backend will get which sequence numbers. You cannot foresee and you cannot reproduce, except you hook into the sequence generator and log this too. Later when recovering, another hook into the sequence generator must reproduce the logged results on the per backend/transaction/command base, and the same must be done for each function that usually returns unreproduceable results (anything dealing with time, pid's, etc.). As said, this must also cover user functions. So at least there must be a general log API that provides such a functionality for user written functions. The image logging approach also has problems. First, the only thing given to the heap access methods to outdate a tuple on update/delete is the current tuple ID (information that tells which tuple in which block is meant). So you need to save the database files in binary format, because during the actually existing dump/restore this could change and the logged CTID's would hit the wrong tuples. Second, you must remember in the log which transaction ID these informations came from and later if the transaction committed or not, so the recovery can set this commit/abort information in pg_log too. pg_log is a shared system file and the transaction ID's are unique only for one server. Using this information for online replication of a single database to another Postgres installation will not work. Third, there are still some shared system catalogs across all databases (pg_database, pg_group, pg_log!!!, pg_shadow and pg_variable). Due to that it would be impossible (or at least very, very tricky) to restore/recover (maybe point in time) one single database. If you destroy one database and restore it from the binary backup, these shared catalogs cannot be restored too, so they're out of sync with the backup time. How should the recovery now hit the right things (which probably must not be there at all)?. All this is really a mess. I think the architecture of Postgres will only allow something on query level with some general API for things that must reproduce the same result during recovery. For example time(). Inside the backend, time() should never be called directly. Instead another function is to be called that log's during normal operation which time get's returned by this particular function call and if the backend is in recovery mode, returns the value from the log. And again, this all means trouble. Usually, most queries sent to the database don't change any data because they are SELECT's. It would dramatically blow up the log amount if you log ALL queries instead of only those that modify things. But when the query begins, you don't know this, because a SELECT might call a function that uses SPI to UPDATE something else. So the decision if the query must be logged or not can only be made when the query is done (by having some global variable where the heap access methods set a flag that something got written). Now you have to log function call's like time() even if the query will not modify any single row in the database because the query is a SELECT 'now'::datetime - updtime FROM ... Doing this on a table with thousands of rows will definitely waste much logging space and slowdown the whole thing by unnecessary logging. Maybe it's a compromise if at each query start the actual time and other such information is remembered by the backend, all time() calls return this remembered value instead of the real one (wouldn't be bad anyway IMHO), and this information is logged only if the query is to be logged. Finally I think I must have missed some more problems, but aren't these enough already to frustrate you :-? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
pgsql-hackers by date: