Thread: Memory leak in XLOG reader facility when decoding records (caused by WAL refactoring)

Hi all,

While running valgring on pg_rewind, I have noticed that each new call
to XLogReadRecord leaks XLogReaderState->main_data and
XLogReaderState->blocks[block_id].data because each one of them is
palloc'd to store respectively the main data of the record and the
data attached to each block. I think that the most correct fix to
prevent the leak is to free those pointers when calling ResetDecoder()
which is used to reset a reader state between two records.
==46805== 154 bytes in 1 blocks are definitely lost in loss record 103 of 167
==46805==    at 0x6B1B: malloc (in
/usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==46805==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
==46805==    by 0x10000DE79: palloc (fe_memutils.c:118)
==46805==    by 0x1000052A0: DecodeXLogRecord (xlogreader.c:1203)
==46805==    by 0x100003AA9: XLogReadRecord (xlogreader.c:461)
==46805==    by 0x1000022F0: extractPageMap (parsexlog.c:78)
==46805==    by 0x10000188C: main (pg_rewind.c:280)

This leak is rather bad particularly when a node is in recovery,
because the recovering node can fail with an OOM after replaying a
certain amount of records, and even worse for a standby node which is
aimed to replay records indefinitely.

Attached is a patch fixing the leak, and I checked that it got fixed
after re-running valgrind. I am adding an entry in the open item list
for 9.5 as well.
Regards,
--
Michael

Attachment
On 2015-07-23 22:50:12 +0900, Michael Paquier wrote:
>
> While running valgring on pg_rewind, I have noticed that each new call
> to XLogReadRecord leaks XLogReaderState->main_data and
> XLogReaderState->blocks[block_id].data because each one of them is
> palloc'd to store respectively the main data of the record and the
> data attached to each block. I think that the most correct fix to
> prevent the leak is to free those pointers when calling ResetDecoder()
> which is used to reset a reader state between two records.

I don't think that'd be a good fix. Isn't the idea here that we only
allocate new buffers when the previous one is too small? Your fix will
greatly increase the memory management overhead.

> ==46805== 154 bytes in 1 blocks are definitely lost in loss record 103 of 167
> ==46805==    at 0x6B1B: malloc (in
> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
> ==46805==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
> ==46805==    by 0x10000DE79: palloc (fe_memutils.c:118)
> ==46805==    by 0x1000052A0: DecodeXLogRecord (xlogreader.c:1203)
> ==46805==    by 0x100003AA9: XLogReadRecord (xlogreader.c:461)
> ==46805==    by 0x1000022F0: extractPageMap (parsexlog.c:78)
> ==46805==    by 0x10000188C: main (pg_rewind.c:280)

Uh. This is the leak?
        if (state->main_data_len > 0)
        {
                if (!state->main_data || state->main_data_len > state->main_data_bufsz)
                {
                        if (state->main_data)
                                pfree(state->main_data);
                        state->main_data_bufsz = state->main_data_len;
                        state->main_data = palloc(state->main_data_bufsz); <--- here
                }
                memcpy(state->main_data, ptr, state->main_data_len);
                ptr += state->main_data_len;
        }

Where/When are we leaking memory here? The previously used memory is
freed before we allocate a larger buffer.

Greetings,

Andres Freund


On Mon, Jul 27, 2015 at 12:32 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-23 22:50:12 +0900, Michael Paquier wrote:
>>
>> While running valgring on pg_rewind, I have noticed that each new call
>> to XLogReadRecord leaks XLogReaderState->main_data and
>> XLogReaderState->blocks[block_id].data because each one of them is
>> palloc'd to store respectively the main data of the record and the
>> data attached to each block. I think that the most correct fix to
>> prevent the leak is to free those pointers when calling ResetDecoder()
>> which is used to reset a reader state between two records.
>
> I don't think that'd be a good fix. Isn't the idea here that we only
> allocate new buffers when the previous one is too small? Your fix will
> greatly increase the memory management overhead.

OK.

>> ==46805== 154 bytes in 1 blocks are definitely lost in loss record 103 of 167
>> ==46805==    at 0x6B1B: malloc (in
>> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
>> ==46805==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
>> ==46805==    by 0x10000DE79: palloc (fe_memutils.c:118)
>> ==46805==    by 0x1000052A0: DecodeXLogRecord (xlogreader.c:1203)
>> ==46805==    by 0x100003AA9: XLogReadRecord (xlogreader.c:461)
>> ==46805==    by 0x1000022F0: extractPageMap (parsexlog.c:78)
>> ==46805==    by 0x10000188C: main (pg_rewind.c:280)
>
> Uh. This is the leak?
>         if (state->main_data_len > 0)
>         {
>                 if (!state->main_data || state->main_data_len > state->main_data_bufsz)
>                 {
>                         if (state->main_data)
>                                 pfree(state->main_data);
>                         state->main_data_bufsz = state->main_data_len;
>                         state->main_data = palloc(state->main_data_bufsz); <--- here
>                 }
>                 memcpy(state->main_data, ptr, state->main_data_len);
>                 ptr += state->main_data_len;
>         }
>
> Where/When are we leaking memory here? The previously used memory is
> freed before we allocate a larger buffer.

Yep, it is here, and it is not related to the calls of XLogReadRecord actually, that's a problem in XLogReaderFree that relies on max_block_id, which gets reinitialized to -1 between two records in ResetDecoder, instead of XLR_MAX_BLOCK_ID to clean up the data blocks when freeing the XLOG reader (note: looking at memory leaks at night can lead to incorrect conclusions). So for example if a first record uses some data blocks and allocates buffers for some data, and then a second record uses no data blocks, we'll leak. So that's less critical than I though as the couple XLofReaderAllocate/XLogReaderFree is called only once in StartupXLOG, still that's a leak.

Attached is a self-contained script able to show the problem if anybody wants to check by himself (the other leaks related to pg_rewind are addressed in another thread), and a patch that takes care of it.
Regards,
--
Michael
Attachment
On 07/27/2015 04:40 AM, Michael Paquier wrote:
> On Mon, Jul 27, 2015 at 12:32 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-07-23 22:50:12 +0900, Michael Paquier wrote:
>>>
>>> While running valgring on pg_rewind, I have noticed that each new call
>>> to XLogReadRecord leaks XLogReaderState->main_data and
>>> XLogReaderState->blocks[block_id].data because each one of them is
>>> palloc'd to store respectively the main data of the record and the
>>> data attached to each block. I think that the most correct fix to
>>> prevent the leak is to free those pointers when calling ResetDecoder()
>>> which is used to reset a reader state between two records.
>>
>> I don't think that'd be a good fix. Isn't the idea here that we only
>> allocate new buffers when the previous one is too small? Your fix will
>> greatly increase the memory management overhead.
>
> OK.
>
>>> ==46805== 154 bytes in 1 blocks are definitely lost in loss record 103
> of 167
>>> ==46805==    at 0x6B1B: malloc (in
>>>
> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
>>> ==46805==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
>>> ==46805==    by 0x10000DE79: palloc (fe_memutils.c:118)
>>> ==46805==    by 0x1000052A0: DecodeXLogRecord (xlogreader.c:1203)
>>> ==46805==    by 0x100003AA9: XLogReadRecord (xlogreader.c:461)
>>> ==46805==    by 0x1000022F0: extractPageMap (parsexlog.c:78)
>>> ==46805==    by 0x10000188C: main (pg_rewind.c:280)
>>
>> Uh. This is the leak?
>>          if (state->main_data_len > 0)
>>          {
>>                  if (!state->main_data || state->main_data_len >
> state->main_data_bufsz)
>>                  {
>>                          if (state->main_data)g
>>                                  pfree(state->main_data);
>>                          state->main_data_bufsz = state->main_data_len;
>>                          state->main_data =
> palloc(state->main_data_bufsz); <--- here
>>                  }
>>                  memcpy(state->main_data, ptr, state->main_data_len);
>>                  ptr += state->main_data_len;
>>          }
>>
>> Where/When are we leaking memory here? The previously used memory is
>> freed before we allocate a larger buffer.
>
> Yep, it is here, and it is not related to the calls of XLogReadRecord
> actually, that's a problem in XLogReaderFree that relies on max_block_id,
> which gets reinitialized to -1 between two records in ResetDecoder, instead
> of XLR_MAX_BLOCK_ID to clean up the data blocks when freeing the XLOG
> reader (note: looking at memory leaks at night can lead to incorrect
> conclusions). So for example if a first record uses some data blocks and
> allocates buffers for some data, and then a second record uses no data
> blocks, we'll leak. So that's less critical than I though as the couple
> XLofReaderAllocate/XLogReaderFree is called only once in StartupXLOG, still
> that's a leak.
>
> Attached is a self-contained script able to show the problem if anybody
> wants to check by himself (the other leaks related to pg_rewind are
> addressed in another thread), and a patch that takes care of it.

Thanks, committed. The leak was not actually in the "main_data"
handling, as quoted in the above snippet, but in the per-block buffers.

- Heikki
On Tue, Jul 28, 2015 at 2:32 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/27/2015 04:40 AM, Michael Paquier wrote:
>>
>> On Mon, Jul 27, 2015 at 12:32 AM, Andres Freund <andres@anarazel.de>
>> wrote:
>>>
>>> On 2015-07-23 22:50:12 +0900, Michael Paquier wrote:
>>>>
>>>>
>>>> While running valgring on pg_rewind, I have noticed that each new call
>>>> to XLogReadRecord leaks XLogReaderState->main_data and
>>>> XLogReaderState->blocks[block_id].data because each one of them is
>>>> palloc'd to store respectively the main data of the record and the
>>>> data attached to each block. I think that the most correct fix to
>>>> prevent the leak is to free those pointers when calling ResetDecoder()
>>>> which is used to reset a reader state between two records.
>>>
>>>
>>> I don't think that'd be a good fix. Isn't the idea here that we only
>>> allocate new buffers when the previous one is too small? Your fix will
>>> greatly increase the memory management overhead.
>>
>>
>> OK.
>>
>>>> ==46805== 154 bytes in 1 blocks are definitely lost in loss record 103
>>
>> of 167
>>>>
>>>> ==46805==    at 0x6B1B: malloc (in
>>>>
>>
>> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
>>>>
>>>> ==46805==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
>>>> ==46805==    by 0x10000DE79: palloc (fe_memutils.c:118)
>>>> ==46805==    by 0x1000052A0: DecodeXLogRecord (xlogreader.c:1203)
>>>> ==46805==    by 0x100003AA9: XLogReadRecord (xlogreader.c:461)
>>>> ==46805==    by 0x1000022F0: extractPageMap (parsexlog.c:78)
>>>> ==46805==    by 0x10000188C: main (pg_rewind.c:280)
>>>
>>>
>>> Uh. This is the leak?
>>>          if (state->main_data_len > 0)
>>>          {
>>>                  if (!state->main_data || state->main_data_len >
>>
>> state->main_data_bufsz)
>>>
>>>                  {
>>>                          if (state->main_data)g
>>>                                  pfree(state->main_data);
>>>                          state->main_data_bufsz = state->main_data_len;
>>>                          state->main_data =
>>
>> palloc(state->main_data_bufsz); <--- here
>>>
>>>                  }
>>>                  memcpy(state->main_data, ptr, state->main_data_len);
>>>                  ptr += state->main_data_len;
>>>          }
>>>
>>> Where/When are we leaking memory here? The previously used memory is
>>> freed before we allocate a larger buffer.
>>
>>
>> Yep, it is here, and it is not related to the calls of XLogReadRecord
>> actually, that's a problem in XLogReaderFree that relies on max_block_id,
>> which gets reinitialized to -1 between two records in ResetDecoder,
>> instead
>> of XLR_MAX_BLOCK_ID to clean up the data blocks when freeing the XLOG
>> reader (note: looking at memory leaks at night can lead to incorrect
>> conclusions). So for example if a first record uses some data blocks and
>> allocates buffers for some data, and then a second record uses no data
>> blocks, we'll leak. So that's less critical than I though as the couple
>> XLofReaderAllocate/XLogReaderFree is called only once in StartupXLOG,
>> still
>> that's a leak.
>>
>> Attached is a self-contained script able to show the problem if anybody
>> wants to check by himself (the other leaks related to pg_rewind are
>> addressed in another thread), and a patch that takes care of it.
>
>
> Thanks, committed. The leak was not actually in the "main_data" handling, as
> quoted in the above snippet, but in the per-block buffers.

The patch pushed does not take care of the leak (tested on 3d5cb31c).
With the test script I sent upthread, I am still seeing it:
==32015== 154 bytes in 1 blocks are definitely lost in loss record 103 of 155
==32015==    at 0x6B1B: malloc (in
/usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==32015==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
==32015==    by 0x10000DE79: palloc (fe_memutils.c:118)
==32015==    by 0x100005270: DecodeXLogRecord (xlogreader.c:1200)
==32015==    by 0x100003A79: XLogReadRecord (xlogreader.c:458)
==32015==    by 0x100002AF0: findLastCheckpoint (parsexlog.c:186)
==32015==    by 0x1000017E1: main (pg_rewind.c:257)
==32015==
==32015== 154 bytes in 1 blocks are definitely lost in loss record 104 of 155
==32015==    at 0x6B1B: malloc (in
/usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==32015==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
==32015==    by 0x10000DE79: palloc (fe_memutils.c:118)
==32015==    by 0x100005270: DecodeXLogRecord (xlogreader.c:1200)
==32015==    by 0x100003A79: XLogReadRecord (xlogreader.c:458)
==32015==    by 0x1000022E0: extractPageMap (parsexlog.c:78)
==32015==    by 0x10000187C: main (pg_rewind.c:280)

The problem is that state->max_block_id gets reset to -1 each time a
new record is read via ResetDecoder. Hence if we decode a record that
has block data and then if a second record that does not contain block
data is decoded, we will leak if XLogReaderFree is called because no
attempts will be made to pfree what has been allocated. Per se the
patch attached.
Regards,
--
Michael

Attachment
On 07/28/2015 02:54 AM, Michael Paquier wrote:
> On Tue, Jul 28, 2015 at 2:32 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Thanks, committed. The leak was not actually in the "main_data" handling, as
>> quoted in the above snippet, but in the per-block buffers.
>
> The patch pushed does not take care of the leak (tested on 3d5cb31c).
> With the test script I sent upthread, I am still seeing it:
> ==32015== 154 bytes in 1 blocks are definitely lost in loss record 103 of 155
> ==32015==    at 0x6B1B: malloc (in
> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
> ==32015==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
> ==32015==    by 0x10000DE79: palloc (fe_memutils.c:118)
> ==32015==    by 0x100005270: DecodeXLogRecord (xlogreader.c:1200)
> ==32015==    by 0x100003A79: XLogReadRecord (xlogreader.c:458)
> ==32015==    by 0x100002AF0: findLastCheckpoint (parsexlog.c:186)
> ==32015==    by 0x1000017E1: main (pg_rewind.c:257)
> ==32015==
> ==32015== 154 bytes in 1 blocks are definitely lost in loss record 104 of 155
> ==32015==    at 0x6B1B: malloc (in
> /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
> ==32015==    by 0x10000DB2D: pg_malloc_internal (fe_memutils.c:30)
> ==32015==    by 0x10000DE79: palloc (fe_memutils.c:118)
> ==32015==    by 0x100005270: DecodeXLogRecord (xlogreader.c:1200)
> ==32015==    by 0x100003A79: XLogReadRecord (xlogreader.c:458)
> ==32015==    by 0x1000022E0: extractPageMap (parsexlog.c:78)
> ==32015==    by 0x10000187C: main (pg_rewind.c:280)
>
> The problem is that state->max_block_id gets reset to -1 each time a
> new record is read via ResetDecoder. Hence if we decode a record that
> has block data and then if a second record that does not contain block
> data is decoded, we will leak if XLogReaderFree is called because no
> attempts will be made to pfree what has been allocated. Per se the
> patch attached.

Ok, committed. I hope we got it right this time..

- Heikki
On Tue, Jul 28, 2015 at 3:11 PM, Heikki Linnakangas wrote:
> On 07/28/2015 02:54 AM, Michael Paquier wrote:
>> The problem is that state->max_block_id gets reset to -1 each time a
>> new record is read via ResetDecoder. Hence if we decode a record that
>> has block data and then if a second record that does not contain block
>> data is decoded, we will leak if XLogReaderFree is called because no
>> attempts will be made to pfree what has been allocated. Per se the
>> patch attached.
>
>
> Ok, committed. I hope we got it right this time..

Thanks! There is nothing more on radar now, so I am closing the issue
on the open item list.
--
Michael
Can this be a reason behind this issue that I was facing? High disk writes
after upgrade from 9.4.3 to 9.4.4.

http://www.postgresql.org/message-id/CAK=sinUouE8XVxBc4d6bcgM5OJVy7R-B9HF_7BQZs6LR87_Ung@mail.gmail.com

On Tue, Jul 28, 2015 at 12:00 PM, Michael Paquier <michael.paquier@gmail.com
> wrote:

> On Tue, Jul 28, 2015 at 3:11 PM, Heikki Linnakangas wrote:
> > On 07/28/2015 02:54 AM, Michael Paquier wrote:
> >> The problem is that state->max_block_id gets reset to -1 each time a
> >> new record is read via ResetDecoder. Hence if we decode a record that
> >> has block data and then if a second record that does not contain block
> >> data is decoded, we will leak if XLogReaderFree is called because no
> >> attempts will be made to pfree what has been allocated. Per se the
> >> patch attached.
> >
> >
> > Ok, committed. I hope we got it right this time..
>
> Thanks! There is nothing more on radar now, so I am closing the issue
> on the open item list.
> --
> Michael
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>
On Tue, Jul 28, 2015 at 3:34 PM, Sushant Sinha <sushant@indiankanoon.com> wrote:
> Can this be a reason behind this issue that I was facing? High disk writes
> after upgrade from 9.4.3 to 9.4.4.
>
> http://www.postgresql.org/message-id/CAK=sinUouE8XVxBc4d6bcgM5OJVy7R-B9HF_7BQZs6LR87_Ung@mail.gmail.com

(please dont top-post).
No that's unrelated, the bug talked here impacts 9.5 and master, not
9.4 and older versions.
--
Michael