"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> Really?! Once again: WAL records give you *physical* address of tuples
> (both heap and index ones!) to be removed and size of log to read
> records from is not comparable with size of data files.
You sure? With our current approach of dumping data pages into the WAL
on first change since checkpoint (and doing so again after each
checkpoint) it's not too difficult to devise scenarios where the WAL log
is *larger* than the affected datafiles ... and can't be truncated until
someone commits.
The copied-data-page traffic is the worst problem with our current
WAL implementation. I did some measurements last week on VACUUM of a
test table (the accounts table from a "pg_bench -s 10" setup, which
contains 1000000 rows; I updated 20000 rows and then vacuumed). This
generated about 34400 8k blocks of WAL traffic, of which about 33300
represented copied pages and the other 1100 blocks were actual WAL
entries. That's a pretty massive I/O overhead, considering the table
itself was under 20000 8k blocks. It was also interesting to note that
a large fraction of the CPU time was spent calculating CRCs on the WAL
data.
Would it be possible to split the WAL traffic into two sets of files,
one for WAL log records proper and one for copied pages? Seems like
we could recycle the pages after each checkpoint rather than hanging
onto them until the associated transactions commit.
>> Why not? Seems to me that establishing implicit savepoints is just a
>> user-interface issue; you can do it, or not do it, regardless of the
>> underlying mechanism.
> Implicit savepoints are setted by server automatically before each
> query execution - you wouldn't use transaction IDs for this.
If the user asked you to, I don't see why not.
regards, tom lane