Hi,
There have been a couple of LZ4-related patches recently, so I spent some time playing with the LZ4 path and found a
bugin astreamer_lz4_decompressor_content().
Looking at the code snippet (omitting irrelevant code):
```
ret = LZ4F_decompress(mystreamer->dctx,
next_out, &out_size,
next_in, &read_size, NULL);
mystreamer->bytes_written += out_size; // <== bumped bytes_written already
/*
* If output buffer is full then forward the content to next streamer
* and update the output buffer.
*/
if (mystreamer->bytes_written >= mystreamer->base.bbs_buffer.maxlen)
{
astreamer_content(mystreamer->base.bbs_next, member,
mystreamer->base.bbs_buffer.data,
mystreamer->base.bbs_buffer.maxlen,
context);
avail_out = mystreamer->base.bbs_buffer.maxlen;
mystreamer->bytes_written = 0;
next_out = (uint8 *) mystreamer->base.bbs_buffer.data;
}
else
{
avail_out = mystreamer->base.bbs_buffer.maxlen - mystreamer->bytes_written;
next_out += mystreamer->bytes_written; // <== The bug is there
}
```
To advance next_out, the code uses mystreamer->bytes_written. However, bytes_written has already been increased by
out_sizein the current iteration. As a result, next_out is advanced by the cumulative number of bytes written so far,
insteadof just the number of bytes produced in this iteration. Effectively, the pointer movement is double-counting the
previousprogress.
When I tried to design a test case to trigger this bug, I found it is actually not easy to hit in normal execution.
Tracinginto the function, I found that the default output buffer size is 1024 bytes, and in practice LZ4F_decompress()
tendsto fill the output buffer in one or two iterations. As a result, the problematic else branch is either not
reached,or reached in a case where bytes_written == out_size, so the incorrect pointer increment does not manifest.
To reliably trigger the bug, I used a small hack: instead of letting LZ4F_decompress() use the full available out_size,
Iartificially limited out_size before the call, forcing LZ4F_decompress() to require one more iteration to fill the
buffer.See the attached nocfbot_hack.diff for the hack.
With that hack in place, the bug can be reproduced using the following procedure:
1. initdb
2 Set "wal_level = replica” in postgreSQl.conf
3. Restart the instance
4. Create a database
5. Generate some WAL logs by psql
```
CREATE TABLE t AS SELECT generate_series(1, 100000) AS id;
CHECKPOINT;
```
6. Create a backup
```
% rm -rf /tmp/bkup_lz4
% pg_basebackup -D /tmp/bkup_lz4 -F t -Z lz4 -X stream -c fast
```
7. Verify the backup
```
% pg_verifybackup -F t -n /tmp/bkup_lz4
pg_verifybackup: error: zsh: trace trap pg_verifybackup -F t -n /tmp/bkup_lz4
```
With the fix applied (plus the hack), step 7 succeeds:
```
% pg_verifybackup -F t -n /tmp/bkup_lz4
backup successfully verified
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/