Review for GetWALAvailability() - Mailing list pgsql-hackers

From Fujii Masao
Subject Review for GetWALAvailability()
Date
Msg-id 8d55969f-ba37-b89a-6494-e9322ccdb35d@oss.nttdata.com
Whole thread Raw
Responses Re: Review for GetWALAvailability()
List pgsql-hackers
Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

     some WAL files are definitely lost and this slot cannot be used to resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


    keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
                              wal_segment_size) + 1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of them.
The above should be the following?

     Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 1



        if ((max_slot_wal_keep_size_mb <= 0 ||
             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
            oldestSegMaxWalSize <= targetSeg)
            return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
  means that the claimed files are within max_wal_size". So whatever
  max_slot_wal_keep_size value is, IMO that "normal" should be
  reported if the WAL files claimed by the slot are within max_wal_size.
  Thought?

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.



If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?



BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: exp() versus the POSIX standard
Next
From: Kenichiro Tanaka
Date:
Subject: Re: Wrong width of UNION statement