Re: [HACKERS] RE: [GENERAL] Transaction logging - Mailing list pgsql-hackers

From jwieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] RE: [GENERAL] Transaction logging
Date
Msg-id m10IzGK-000EBQC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to RE: [GENERAL] Transaction logging  (Michael Davis <michael.davis@prevuenet.com>)
Responses Re: [HACKERS] RE: [GENERAL] Transaction logging
List pgsql-hackers
Michael Davis wrote:

>
> What would it take to have transaction logging added to Postgres.  I am a
> c/c++ programmer and will consider contributing to the Postgres development
> effort.  I really like everything I see and read about Postgres.  As a

    I spent some time on transaction logging since it's a feature
    I'm missing too. There are mainly two  different  transaction
    log mechanisms out.

    1.  Log queries sent to the backend.

    2.  Log  images  of  inserted/updated  rows  and  row ID's of
        deleted ones.

    The query  level  logging  will  write  less  information  if
    queries usually affect a large number of rows.  Unfortunately
    the extensibility of Postgres work's against  this  approach.
    There  could  be  any  number of user written functions who's
    results  aren't  reproduceable  during  recovery.  And   core
    features of Postgres itself would introduce the same problem.
    Have a sequence which is used to create  default  values  for
    multiple  tables,  so  that one ID is unique across them. Now
    two backends insert (with  INSERT  ...  SELECT)  concurrently
    into  different  tables  using  the  same  sequence.   It's a
    classic race condition and it depends  on  context  switching
    and  page  faults  which  backend  will  get  which  sequence
    numbers. You cannot foresee and you cannot reproduce,  except
    you  hook into the sequence generator and log this too. Later
    when recovering, another hook  into  the  sequence  generator
    must    reproduce    the    logged   results   on   the   per
    backend/transaction/command base, and the same must  be  done
    for   each  function  that  usually  returns  unreproduceable
    results (anything dealing with time, pid's, etc.).

    As said, this must also cover user  functions.  So  at  least
    there  must  be  a  general  log  API  that  provides  such a
    functionality for user written functions.

    The image logging approach also has problems. First, the only
    thing  given to the heap access methods to outdate a tuple on
    update/delete is the current tuple ID (information that tells
    which  tuple  in  which block is meant).  So you need to save
    the database files  in  binary  format,  because  during  the
    actually  existing  dump/restore  this  could  change and the
    logged CTID's would hit the wrong tuples.

    Second, you must remember in the  log  which  transaction  ID
    these  informations  came  from  and later if the transaction
    committed or not, so the recovery can set  this  commit/abort
    information in pg_log too. pg_log is a shared system file and
    the transaction ID's are unique only for one  server.   Using
    this  information for online replication of a single database
    to another Postgres installation will not work.

    Third, there are still some shared system catalogs across all
    databases  (pg_database,  pg_group,  pg_log!!!, pg_shadow and
    pg_variable). Due to that it would be impossible (or at least
    very,  very  tricky) to restore/recover (maybe point in time)
    one single database. If you destroy one database and  restore
    it  from  the  binary backup, these shared catalogs cannot be
    restored too, so they're out of sync with  the  backup  time.
    How  should  the  recovery  now  hit  the right things (which
    probably must not be there at all)?.

    All this is really  a  mess.  I  think  the  architecture  of
    Postgres  will  only allow something on query level with some
    general API for things that must reproduce  the  same  result
    during  recovery.  For  example  time().  Inside the backend,
    time() should  never  be  called  directly.  Instead  another
    function  is  to be called that log's during normal operation
    which time get's returned by this  particular  function  call
    and  if  the  backend  is in recovery mode, returns the value
    from the log.

    And again, this all means trouble. Usually, most queries sent
    to  the  database  don't  change  any  data  because they are
    SELECT's. It would dramatically blow up the log amount if you
    log ALL queries instead of only those that modify things. But
    when the query begins, you don't know this, because a  SELECT
    might call a function that uses SPI to UPDATE something else.
    So the decision if the query must be logged or not  can  only
    be  made  when  the  query  is  done  (by  having some global
    variable where the  heap  access  methods  set  a  flag  that
    something  got written).  Now you have to log function call's
    like time() even if the query will not modify any single  row
    in the database because the query is a

    SELECT 'now'::datetime - updtime FROM ...

    Doing  this on a table with thousands of rows will definitely
    waste much logging space and  slowdown  the  whole  thing  by
    unnecessary logging.

    Maybe  it's  a  compromise  if at each query start the actual
    time and other such information is remembered by the backend,
    all  time() calls return this remembered value instead of the
    real one (wouldn't be bad anyway IMHO), and this  information
    is logged only if the query is to be logged.

    Finally  I  think  I must have missed some more problems, but
    aren't these enough already to frustrate you :-?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: "Jackson, DeJuan"
Date:
Subject: PL/PGSQL
Next
From: "Jackson, DeJuan"
Date:
Subject: RE: [HACKERS] PL/PGSQL