Exposing the Xact commit order to the user - Mailing list pgsql-hackers
From | Jan Wieck |
---|---|
Subject | Exposing the Xact commit order to the user |
Date | |
Msg-id | 4BF98E66.9000703@Yahoo.com Whole thread Raw |
Responses |
Re: Exposing the Xact commit order to the user
Re: Exposing the Xact commit order to the user Re: Exposing the Xact commit order to the user Re: Exposing the Xact commit order to the user Re: Exposing the Xact commit order to the user Re: Exposing the Xact commit order to the user |
List | pgsql-hackers |
In some systems (data warehousing, replication), the order of commits is important, since that is the order in which changes have become visible. This information could theoretically be extracted from the WAL, but scanning the entire WAL just to extract this tidbit of information would be excruciatingly painful. The following is based on ideas that emerged during last weeks PGCon. Consider it an implementation proposal, if you like. We introduce a new set of files. The files represent segments of an infinite array of structures. The present segments are the available "window" of data. Similar to CLOG files, the individual file name will represent the high bits of a "serial" number, the offset of the record inside the file represents the low bits of the "serial". The system will have postgresql.conf options for enabling/disabling the whole shebang, how many shared buffers to allocate for managing access to the data and to define the retention period of the data based on data volume and/or age of the commit records. Each record of the Transaction Commit Info consists of txid xci_transaction_id timestamptz xci_begin_timestamp timestamptz xci_commit_timestamp int64 xci_total_rowcount 32 bytes total. CommitTransaction() inside of xact.c will call a function, that inserts a new record into this array. The operation will for most of the time be nothing than taking a spinlock and adding the record to shared memory. All the data for the record is readily available, does not require further locking and can be collected locally before taking the spinlock. The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the commit_timestamp is what CommitTransaction() just decided to write into the WAL commit record and the total_rowcount is the sum of inserted, updated and deleted heap tuples during the transaction, which should be easily available from the statistics collector, unless row stats are disabled, in which case the datum would be zero. The function will return the "sequence" number which CommitTransaction() in turn will record in the WAL commit record together with the begin_timestamp. While both, the begin as well as the commit timestamp are crucial to determine what data a particular transaction should have seen, the row count is not and will not be recorded in WAL. Checkpoint handling will call a function to flush the shared buffers. Together with this, the information from WAL records will be sufficient to recover this data (except for row counts) during crash recovery. Exposing the data will be done via a set returning function. The SRF takes two arguments. The maximum number of rows to return and the last serial number processed by the reader. The advantage of such SRF is that the result can be used in a query that right away delivers audit or replication log information in transaction commit order. The SRF can return an empty set if no further transactions have committed since, or an error if data segments needed to answer the request have already been purged. Purging of the data will be possible in several different ways. Autovacuum will call a function that drops segments of the data that are outside the postgresql.conf configuration withrespect to maximum age or data volume. There will also be a function reserved for superusers to explicitly purge the data up to a certain serial number. Comments, suggestions? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
pgsql-hackers by date: