Thread: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15217
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 10.4
Operating system:   Ubuntu 16.04
Description:

Hello,

When running PostgreSQL (REL_10_STABLE) under valgrind I am getting the
following error messages sporadically:

==00:03:26:22.952 1623== Syscall param write(buf) points to uninitialised
byte(s)
==00:03:26:22.952 1623==    at 0x4E4A4A0: __write_nocancel
(syscall-template.S:84)
==00:03:26:22.952 1623==    by 0x72F0EF: FileWrite (fd.c:1773)
==00:03:26:22.952 1623==    by 0x7310D1: BufFileDumpBuffer (buffile.c:324)
==00:03:26:22.952 1623==    by 0x7313F1: BufFileFlush (buffile.c:466)
==00:03:26:22.952 1623==    by 0x7313F1: BufFileRead (buffile.c:372)
==00:03:26:22.952 1623==    by 0x894958: ltsReadBlock (logtape.c:248)
==00:03:26:22.952 1623==    by 0x894A34: ltsReadFillBuffer (logtape.c:275)
==00:03:26:22.952 1623==    by 0x89CB9F: mergeruns (tuplesort.c:2683)
==00:03:26:22.952 1623==    by 0x89CB9F: tuplesort_performsort
(tuplesort.c:1823)
==00:03:26:22.952 1623==    by 0x6414C2: ExecSort (nodeSort.c:117)
==00:03:26:22.952 1623==    by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623==    by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623==    by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623==    by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623==    by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623==  Address 0xfa8a27e is 6,350 bytes inside a block of
size 8,256 client-defined
==00:03:26:22.952 1623==    at 0x88FC20: palloc (mcxt.c:872)
==00:03:26:22.952 1623==    by 0x7312CF: makeBufFile (buffile.c:107)
==00:03:26:22.952 1623==    by 0x7312CF: BufFileCreateTemp (buffile.c:175)
==00:03:26:22.952 1623==    by 0x894B14: LogicalTapeSetCreate
(logtape.c:390)
==00:03:26:22.952 1623==    by 0x89B2F2: inittapes (tuplesort.c:2410)
==00:03:26:22.952 1623==    by 0x89B2F2: puttuple_common
(tuplesort.c:1622)
==00:03:26:22.952 1623==    by 0x89C4CE: tuplesort_putheaptuple
(tuplesort.c:1397)
==00:03:26:22.952 1623==    by 0x6414A0: ExecSort (nodeSort.c:111)
==00:03:26:22.952 1623==    by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623==    by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623==    by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623==    by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623==    by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623==    by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623==    by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623==    by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623==  Uninitialised value was created by a heap
allocation
==00:03:26:22.952 1623==    at 0x88FC20: palloc (mcxt.c:872)
==00:03:26:22.952 1623==    by 0x894DC2: LogicalTapeWrite (logtape.c:476)
==00:03:26:22.952 1623==    by 0x895D81: writetup_heap (tuplesort.c:3757)
==00:03:26:22.952 1623==    by 0x89AEE9: dumpbatch (tuplesort.c:3100)
==00:03:26:22.952 1623==    by 0x89AEE9: dumptuples (tuplesort.c:2972)
==00:03:26:22.952 1623==    by 0x89CA0B: tuplesort_performsort
(tuplesort.c:1822)
==00:03:26:22.952 1623==    by 0x6414C2: ExecSort (nodeSort.c:117)
==00:03:26:22.952 1623==    by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623==    by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623==    by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623==    by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623==    by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623==    by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623==    by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623==    by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623==    by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623==    by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623== 
{
   <insert_a_suppression_name_here>
   Memcheck:Param
   write(buf)
   fun:__write_nocancel
   fun:FileWrite
   fun:BufFileDumpBuffer
   fun:BufFileFlush
   fun:BufFileRead
   fun:ltsReadBlock
   fun:ltsReadFillBuffer
   fun:mergeruns
   fun:tuplesort_performsort
   fun:ExecSort
   fun:ExecProcNode
   fun:fetch_input_tuple
   fun:agg_retrieve_direct
   fun:ExecAgg
   fun:ExecProcNode
   fun:gather_merge_readnext
   fun:gather_merge_init
   fun:gather_merge_getnext
   fun:ExecGatherMerge
}

Is this something that needs further investigation (and fix) or it just
looks like a false positive (and should be added to valgrind.supp)?


Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

From
Peter Geoghegan
Date:
On Tue, May 29, 2018 at 9:02 PM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Is this something that needs further investigation (and fix) or it just
> looks like a false positive (and should be added to valgrind.supp)?

This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

-- 
Peter Geoghegan


Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

From
Alexander Lakhin
Date:
30.05.2018 08:54, Peter Geoghegan wrote:
This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

Thanks for the answer!

The work_mem setting is 4MB. To get EXPLAIN ANALYZE is rather difficult as I get these errors while running pg_tcpds for hours, but I'll try to catch the query. I will also try to play with work_mem and to apply the aforementioned patches to REL_10_STABLE.

Best regards,

------
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

From
Alexander Lakhin
Date:
30.05.2018 09:35, Alexander Lakhin wrote:
30.05.2018 08:54, Peter Geoghegan wrote:
This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

Thanks for the answer!

The work_mem setting is 4MB. To get EXPLAIN ANALYZE is rather difficult as I get these errors while running pg_tcpds for hours, but I'll try to catch the query. I will also try to play with work_mem and to apply the aforementioned patches to REL_10_STABLE.

I've managed to reproduce this warning with pg_tpch.
The "problematic" query: https://github.com/tvondra/pg_tpch/blob/master/dss/templates/10.sql
EXPLAIN ANALYZE:
                                                                                                  QUERY PLAN                                                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=178881.96..178882.01 rows=20 width=202) (actual time=41708.427..41708.994 rows=20 loops=1)
   ->  Sort  (cost=178881.96..179020.08 rows=55249 width=202) (actual time=41706.775..41706.828 rows=20 loops=1)
         Sort Key: (sum((lineitem.l_extendedprice * ('1'::numeric - lineitem.l_discount)))) DESC
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Finalize GroupAggregate  (cost=170313.58..177411.81 rows=55249 width=202) (actual time=27586.165..40109.437 rows=37948 loops=1)
               Group Key: customer.c_custkey, nation.n_name
               ->  Gather Merge  (cost=170313.58..176260.79 rows=46040 width=202) (actual time=27570.718..33645.648 rows=37948 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Partial GroupAggregate  (cost=169313.56..169946.61 rows=23020 width=202) (actual time=26845.343..34327.273 rows=12649 loops=3)
                           Group Key: customer.c_custkey, nation.n_name
                           ->  Sort  (cost=169313.56..169371.11 rows=23020 width=182) (actual time=26801.917..27406.584 rows=38258 loops=3)
                                 Sort Key: customer.c_custkey, nation.n_name
                                 Sort Method: external sort  Disk: 7016kB
                                 ->  Hash Join  (cost=35074.91..165598.19 rows=23020 width=182) (actual time=8882.471..24071.174 rows=38258 loops=3)
                                       Hash Cond: (customer.c_nationkey = nation.n_nationkey)
                                       ->  Nested Loop  (cost=35073.35..165314.18 rows=23020 width=164) (actual time=8833.355..23146.172 rows=38258 loops=3)
                                             ->  Merge Join  (cost=35072.92..42628.04 rows=23461 width=156) (actual time=8823.262..13619.941 rows=19073 loops=3)
                                                   Merge Cond: (customer.c_custkey = orders.o_custkey)
                                                   ->  Parallel Index Scan using customer_pkey on customer  (cost=0.42..6742.42 rows=62500 width=152) (actual time=13.463..1101.789 rows=50000 loops=3)
                                                   ->  Materialize  (cost=35072.48..35354.01 rows=56306 width=12) (actual time=8795.265..10510.505 rows=57105 loops=3)
                                                         ->  Sort  (cost=35072.48..35213.24 rows=56306 width=12) (actual time=8790.529..9667.322 rows=57105 loops=3)
                                                               Sort Key: orders.o_custkey
                                                               Sort Method: external merge  Disk: 1256kB
                                                               ->  Bitmap Heap Scan on orders  (cost=1197.56..29667.15 rows=56306 width=12) (actual time=273.899..6733.885 rows=57218 loops=3)
                                                                     Recheck Cond: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate < '1993-10-01 00:00:00'::timestamp without time zone))
                                                                     Rows Removed by Index Recheck: 430095
                                                                     Heap Blocks: exact=15937 lossy=8327
                                                                     ->  Bitmap Index Scan on idx_orders_orderdate  (cost=0.00..1183.49 rows=56306 width=0) (actual time=225.114..225.114 rows=57218 loops=3)
                                                                           Index Cond: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate < '1993-10-01 00:00:00'::timestamp without time zone))
                                             ->  Index Scan using idx_lineitem_orderkey on lineitem  (cost=0.43..5.19 rows=4 width=20) (actual time=0.372..0.448 rows=2 loops=57218)
                                                   Index Cond: (l_orderkey = orders.o_orderkey)
                                                   Filter: (l_returnflag = 'R'::bpchar)
                                                   Rows Removed by Filter: 2
                                       ->  Hash  (cost=1.25..1.25 rows=25 width=30) (actual time=31.890..31.890 rows=25 loops=3)
                                             Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                             ->  Seq Scan on nation  (cost=0.00..1.25 rows=25 width=30) (actual time=6.582..8.616 rows=25 loops=3)
 Planning time: 689.897 ms
 Execution time: 41835.301 ms
(39 rows)

(The work_mem setting was decreased to 1MB.)
Valgrind logs of the three backends affected are attached.

After applying the changes from 9fafa413ac6 and de6428afe13 I don't see this warning.

Best regards,

------
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment