Thread: DBT-3 with SF=20 got failed

DBT-3 with SF=20 got failed

From
Kouhei Kaigai
Date:
Hello,

I got the following error during DBT-3 benchmark with SF=20.
 psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824 psql:query21.sql:50: ERROR:  invalid memory
allocrequest size 1073741824 

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0  0x00007f669d29a989 in raise () from /lib64/libc.so.6
#1  0x00007f669d29c098 in abort () from /lib64/libc.so.6
#2  0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130 "!(((Size) (size) <= ((Size) 0x3fffffff)))",
errorType=0xb17efd"FailedAssertion", fileName=0xb17e40 "mcxt.c", lineNumber=856) at assert.c:54 
#3  0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4  0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0, hashOperators=0x24dbf90, keepNulls=0 '\000') at
nodeHash.c:391
#5  0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6  0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7  0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8  0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9  0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
Indeed, this hash table is constructed towards the relation with nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be "Huge" version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.
--------------------------------------------------------------------
dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-#         s_name,
dbt3c-#         count(*) as numwait
dbt3c-# from
dbt3c-#         supplier,
dbt3c-#         lineitem l1,
dbt3c-#         orders,
dbt3c-#         nation
dbt3c-# where
dbt3c-#         s_suppkey = l1.l_suppkey
dbt3c-#         and o_orderkey = l1.l_orderkey
dbt3c-#         and o_orderstatus = 'F'
dbt3c-#         and l1.l_receiptdate > l1.l_commitdate
dbt3c-#         and exists (
dbt3c(#                 select
dbt3c(#                         *
dbt3c(#                 from
dbt3c(#                         lineitem l2
dbt3c(#                 where
dbt3c(#                         l2.l_orderkey = l1.l_orderkey
dbt3c(#                         and l2.l_suppkey <> l1.l_suppkey
dbt3c(#         )
dbt3c-#         and not exists (
dbt3c(#                 select
dbt3c(#                         *
dbt3c(#                 from
dbt3c(#                         lineitem l3
dbt3c(#                 where
dbt3c(#                         l3.l_orderkey = l1.l_orderkey
dbt3c(#                         and l3.l_suppkey <> l1.l_suppkey
dbt3c(#                         and l3.l_receiptdate > l3.l_commitdate
dbt3c(#         )
dbt3c-#         and s_nationkey = n_nationkey
dbt3c-#         and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-#         s_name
dbt3c-# order by
dbt3c-#         numwait desc,
dbt3c-#         s_name
dbt3c-# LIMIT 100;
   QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------------------------
------------------Limit  (cost=6792765.24..6792765.24 rows=1 width=26)  Output: supplier.s_name, (count(*))  ->  Sort
(cost=6792765.24..6792765.24rows=1 width=26)        Output: supplier.s_name, (count(*))        Sort Key: (count(*))
DESC,supplier.s_name        ->  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)              Output:
supplier.s_name,count(*)              Group Key: supplier.s_name              ->  Nested Loop Anti Join
(cost=4831094.94..6792765.21rows=1 width=26)                    Output: supplier.s_name                    ->  Nested
Loop (cost=4831094.37..6792737.52 rows=1 width=34)                          Output: supplier.s_name, l1.l_suppkey,
l1.l_orderkey                         Join Filter: (supplier.s_nationkey = nation.n_nationkey)
-> Nested Loop  (cost=4831094.37..6792736.19 rows=1 width=38)                                Output: supplier.s_name,
supplier.s_nationkey,l1.l_suppkey, l1.l_orderkey                                ->  Nested Loop
(cost=4831093.81..6792728.20rows=1 width=42)                                      Output: supplier.s_name,
supplier.s_nationkey,l1.l_suppkey, l1.l_orderkey, l2.l_orderkey                                      Join Filter:
(l1.l_suppkey= supplier.s_suppkey)                                      ->  Hash Semi Join
(cost=4831093.81..6783870.20rows=1 width=12)                                            Output: l1.l_suppkey,
l1.l_orderkey,l2.l_orderkey                                            Hash Cond: (l1.l_orderkey = l2.l_orderkey)
                                    Join Filter: (l2.l_suppkey <> l1.l_suppkey)
  ->  Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l1  (cost=0.57..1847781.73 rows 
=39998181 width=8)                                                  Output: l1.l_orderkey, l1.l_partkey, l1.l_suppkey,
l1.l_linenumber,l1.l_quantity, l1.l_extende 
dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus, l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate,
l1.l_shipinstruct,l1.l_shipm 
ode, l1.l_comment                                            ->  Hash  (cost=3331161.44..3331161.44 rows=119994544
width=8)                                                 Output: l2.l_orderkey, l2.l_suppkey
                     ->  Seq Scan on public.lineitem l2  (cost=0.00..3331161.44 rows=119994544 width=8)
                                      Output: l2.l_orderkey, l2.l_suppkey                                      ->  Seq
Scanon public.supplier  (cost=0.00..6358.00 rows=200000 width=34)                                            Output:
supplier.s_suppkey,supplier.s_name, supplier.s_address, supplier.s_nationkey, supplier.s_pho 
ne, supplier.s_acctbal, supplier.s_comment                                ->  Index Scan using
orders_o_orderkey_o_orderdate_idxon public.orders  (cost=0.56..7.98 rows=1 width=4)
Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus, orders.o_totalprice,
orders.o_orderdate,orders.o_orderpriority,orders.o_clerk, orders.o_shippriority, orders.o_comment
              Index Cond: (orders.o_orderkey = l1.l_orderkey)                                      Filter:
(orders.o_orderstatus= 'F'::bpchar)                          ->  Seq Scan on public.nation  (cost=0.00..1.31 rows=1
width=4)                               Output: nation.n_nationkey, nation.n_name, nation.n_regionkey, nation.n_comment
                             Filter: (nation.n_name = 'UNITED KINGDOM'::bpchar)                    ->  Index Scan using
lineitem_l_orderkey_idx_part1on public.lineitem l3  (cost=0.57..13.69 rows=89 width=8)                          Output:
l3.l_orderkey,l3.l_partkey, l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice, l3.l_discount, l 
3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate, l3.l_receiptdate, l3.l_shipinstruct,
l3.l_shipmode,l3.l_comment                          Index Cond: (l3.l_orderkey = l1.l_orderkey)
Filter: (l3.l_suppkey <> l1.l_suppkey) 
(41 rows)

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: DBT-3 with SF=20 got failed

From
Merlin Moncure
Date:
On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Hello,
>
> I got the following error during DBT-3 benchmark with SF=20.
>
>   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
>   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
>
> It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
> the limitation of none-huge interface.
>
> (gdb) bt
> #0  0x00007f669d29a989 in raise () from /lib64/libc.so.6
> #1  0x00007f669d29c098 in abort () from /lib64/libc.so.6
> #2  0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130 "!(((Size) (size) <= ((Size) 0x3fffffff)))",
>     errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c", lineNumber=856) at assert.c:54
> #3  0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
> #4  0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0, hashOperators=0x24dbf90, keepNulls=0 '\000') at
nodeHash.c:391
> #5  0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
> #6  0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
> #7  0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
> #8  0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
> #9  0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
> #10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
> #11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
> #12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
> #13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
> #14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469
>
> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
> Indeed, this hash table is constructed towards the relation with nrows=119994544,
> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> Another allocation request potentially reset of expand hash-slot may also needs
> to be "Huge" version of memory allocation, I think.
>
> Thanks,
>
> Below is the query itself and EXPLAIN result.
> --------------------------------------------------------------------
> dbt3c=# EXPLAIN VERBOSE
> dbt3c-# select
> dbt3c-#         s_name,
> dbt3c-#         count(*) as numwait
> dbt3c-# from
> dbt3c-#         supplier,
> dbt3c-#         lineitem l1,
> dbt3c-#         orders,
> dbt3c-#         nation
> dbt3c-# where
> dbt3c-#         s_suppkey = l1.l_suppkey
> dbt3c-#         and o_orderkey = l1.l_orderkey
> dbt3c-#         and o_orderstatus = 'F'
> dbt3c-#         and l1.l_receiptdate > l1.l_commitdate
> dbt3c-#         and exists (
> dbt3c(#                 select
> dbt3c(#                         *
> dbt3c(#                 from
> dbt3c(#                         lineitem l2
> dbt3c(#                 where
> dbt3c(#                         l2.l_orderkey = l1.l_orderkey
> dbt3c(#                         and l2.l_suppkey <> l1.l_suppkey
> dbt3c(#         )
> dbt3c-#         and not exists (
> dbt3c(#                 select
> dbt3c(#                         *
> dbt3c(#                 from
> dbt3c(#                         lineitem l3
> dbt3c(#                 where
> dbt3c(#                         l3.l_orderkey = l1.l_orderkey
> dbt3c(#                         and l3.l_suppkey <> l1.l_suppkey
> dbt3c(#                         and l3.l_receiptdate > l3.l_commitdate
> dbt3c(#         )
> dbt3c-#         and s_nationkey = n_nationkey
> dbt3c-#         and n_name = 'UNITED KINGDOM'
> dbt3c-# group by
> dbt3c-#         s_name
> dbt3c-# order by
> dbt3c-#         numwait desc,
> dbt3c-#         s_name
> dbt3c-# LIMIT 100;
>
>     QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>
--------------------------------------------------------------------------------------------------------------------------------------------------
> ------------------
>  Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
>    Output: supplier.s_name, (count(*))
>    ->  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
>          Output: supplier.s_name, (count(*))
>          Sort Key: (count(*)) DESC, supplier.s_name
>          ->  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
>                Output: supplier.s_name, count(*)
>                Group Key: supplier.s_name
>                ->  Nested Loop Anti Join  (cost=4831094.94..6792765.21 rows=1 width=26)
>                      Output: supplier.s_name
>                      ->  Nested Loop  (cost=4831094.37..6792737.52 rows=1 width=34)
>                            Output: supplier.s_name, l1.l_suppkey, l1.l_orderkey
>                            Join Filter: (supplier.s_nationkey = nation.n_nationkey)
>                            ->  Nested Loop  (cost=4831094.37..6792736.19 rows=1 width=38)
>                                  Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey
>                                  ->  Nested Loop  (cost=4831093.81..6792728.20 rows=1 width=42)
>                                        Output: supplier.s_name, supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey,
l2.l_orderkey
>                                        Join Filter: (l1.l_suppkey = supplier.s_suppkey)
>                                        ->  Hash Semi Join  (cost=4831093.81..6783870.20 rows=1 width=12)
>                                              Output: l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
>                                              Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>                                              Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>                                              ->  Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l1
(cost=0.57..1847781.73 rows
 
> =39998181 width=8)
>                                                    Output: l1.l_orderkey, l1.l_partkey, l1.l_suppkey,
l1.l_linenumber,l1.l_quantity, l1.l_extende
 
> dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus, l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate,
l1.l_shipinstruct,l1.l_shipm
 
> ode, l1.l_comment
>                                              ->  Hash  (cost=3331161.44..3331161.44 rows=119994544 width=8)
>                                                    Output: l2.l_orderkey, l2.l_suppkey
>                                                    ->  Seq Scan on public.lineitem l2  (cost=0.00..3331161.44
rows=119994544width=8)
 
>                                                          Output: l2.l_orderkey, l2.l_suppkey
>                                        ->  Seq Scan on public.supplier  (cost=0.00..6358.00 rows=200000 width=34)
>                                              Output: supplier.s_suppkey, supplier.s_name, supplier.s_address,
supplier.s_nationkey,supplier.s_pho
 
> ne, supplier.s_acctbal, supplier.s_comment
>                                  ->  Index Scan using orders_o_orderkey_o_orderdate_idx on public.orders
(cost=0.56..7.98rows=1 width=4)
 
>                                        Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus,
orders.o_totalprice,orders.o_orderdate,
 
>  orders.o_orderpriority, orders.o_clerk, orders.o_shippriority, orders.o_comment
>                                        Index Cond: (orders.o_orderkey = l1.l_orderkey)
>                                        Filter: (orders.o_orderstatus = 'F'::bpchar)
>                            ->  Seq Scan on public.nation  (cost=0.00..1.31 rows=1 width=4)
>                                  Output: nation.n_nationkey, nation.n_name, nation.n_regionkey, nation.n_comment
>                                  Filter: (nation.n_name = 'UNITED KINGDOM'::bpchar)
>                      ->  Index Scan using lineitem_l_orderkey_idx_part1 on public.lineitem l3  (cost=0.57..13.69
rows=89width=8)
 
>                            Output: l3.l_orderkey, l3.l_partkey, l3.l_suppkey, l3.l_linenumber, l3.l_quantity,
l3.l_extendedprice,l3.l_discount, l 
> 3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate, l3.l_receiptdate, l3.l_shipinstruct,
l3.l_shipmode,l3.l_comment
 
>                            Index Cond: (l3.l_orderkey = l1.l_orderkey)
>                            Filter: (l3.l_suppkey <> l1.l_suppkey)

curious: what was work_mem set to?

merlin



Re: DBT-3 with SF=20 got failed

From
Kouhei Kaigai
Date:
> curious: what was work_mem set to?
>
work_mem=48GB

My machine mounts 256GB physical RAM.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: Merlin Moncure [mailto:mmoncure@gmail.com]
> Sent: Thursday, June 11, 2015 10:52 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] DBT-3 with SF=20 got failed
> 
> On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > Hello,
> >
> > I got the following error during DBT-3 benchmark with SF=20.
> >
> >   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
> >   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
> >
> > It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
> > the limitation of none-huge interface.
> >
> > (gdb) bt
> > #0  0x00007f669d29a989 in raise () from /lib64/libc.so.6
> > #1  0x00007f669d29c098 in abort () from /lib64/libc.so.6
> > #2  0x000000000090ccfd in ExceptionalCondition (conditionName=0xb18130
> "!(((Size) (size) <= ((Size) 0x3fffffff)))",
> >     errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c",
> lineNumber=856) at assert.c:54
> > #3  0x000000000093ad53 in palloc0 (size=1073741824) at mcxt.c:856
> > #4  0x0000000000673045 in ExecHashTableCreate (node=0x7f669de951f0,
> hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
> > #5  0x00000000006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
> > #6  0x000000000065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
> > #7  0x0000000000681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
> > #8  0x000000000065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
> > #9  0x0000000000681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
> > #10 0x000000000065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
> > #11 0x0000000000681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
> > #12 0x000000000065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
> > #13 0x0000000000681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
> > #14 0x000000000065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469
> >
> > The attached patch replaces this palloc0() by MemoryContextAllocHuge() +
> memset().
> > Indeed, this hash table is constructed towards the relation with
> nrows=119994544,
> > so, it is not strange even if hash-slot itself is larger than 1GB.
> >
> > Another allocation request potentially reset of expand hash-slot may also needs
> > to be "Huge" version of memory allocation, I think.
> >
> > Thanks,
> >
> > Below is the query itself and EXPLAIN result.
> > --------------------------------------------------------------------
> > dbt3c=# EXPLAIN VERBOSE
> > dbt3c-# select
> > dbt3c-#         s_name,
> > dbt3c-#         count(*) as numwait
> > dbt3c-# from
> > dbt3c-#         supplier,
> > dbt3c-#         lineitem l1,
> > dbt3c-#         orders,
> > dbt3c-#         nation
> > dbt3c-# where
> > dbt3c-#         s_suppkey = l1.l_suppkey
> > dbt3c-#         and o_orderkey = l1.l_orderkey
> > dbt3c-#         and o_orderstatus = 'F'
> > dbt3c-#         and l1.l_receiptdate > l1.l_commitdate
> > dbt3c-#         and exists (
> > dbt3c(#                 select
> > dbt3c(#                         *
> > dbt3c(#                 from
> > dbt3c(#                         lineitem l2
> > dbt3c(#                 where
> > dbt3c(#                         l2.l_orderkey = l1.l_orderkey
> > dbt3c(#                         and l2.l_suppkey <> l1.l_suppkey
> > dbt3c(#         )
> > dbt3c-#         and not exists (
> > dbt3c(#                 select
> > dbt3c(#                         *
> > dbt3c(#                 from
> > dbt3c(#                         lineitem l3
> > dbt3c(#                 where
> > dbt3c(#                         l3.l_orderkey = l1.l_orderkey
> > dbt3c(#                         and l3.l_suppkey <> l1.l_suppkey
> > dbt3c(#                         and l3.l_receiptdate > l3.l_commitdate
> > dbt3c(#         )
> > dbt3c-#         and s_nationkey = n_nationkey
> > dbt3c-#         and n_name = 'UNITED KINGDOM'
> > dbt3c-# group by
> > dbt3c-#         s_name
> > dbt3c-# order by
> > dbt3c-#         numwait desc,
> > dbt3c-#         s_name
> > dbt3c-# LIMIT 100;
> >
> >     QUERY PLAN
> >
> >
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------
> >
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------
> > ------------------
> >  Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
> >    Output: supplier.s_name, (count(*))
> >    ->  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
> >          Output: supplier.s_name, (count(*))
> >          Sort Key: (count(*)) DESC, supplier.s_name
> >          ->  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
> >                Output: supplier.s_name, count(*)
> >                Group Key: supplier.s_name
> >                ->  Nested Loop Anti Join  (cost=4831094.94..6792765.21
> rows=1 width=26)
> >                      Output: supplier.s_name
> >                      ->  Nested Loop  (cost=4831094.37..6792737.52 rows=1
> width=34)
> >                            Output: supplier.s_name, l1.l_suppkey,
> l1.l_orderkey
> >                            Join Filter: (supplier.s_nationkey =
> nation.n_nationkey)
> >                            ->  Nested Loop  (cost=4831094.37..6792736.19
> rows=1 width=38)
> >                                  Output: supplier.s_name,
> supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey
> >                                  ->  Nested Loop
> (cost=4831093.81..6792728.20 rows=1 width=42)
> >                                        Output: supplier.s_name,
> supplier.s_nationkey, l1.l_suppkey, l1.l_orderkey, l2.l_orderkey
> >                                        Join Filter: (l1.l_suppkey =
> supplier.s_suppkey)
> >                                        ->  Hash Semi Join
> (cost=4831093.81..6783870.20 rows=1 width=12)
> >                                              Output: l1.l_suppkey,
> l1.l_orderkey, l2.l_orderkey
> >                                              Hash Cond: (l1.l_orderkey =
> l2.l_orderkey)
> >                                              Join Filter: (l2.l_suppkey <>
> l1.l_suppkey)
> >                                              ->  Index Scan using
> lineitem_l_orderkey_idx_part1 on public.lineitem l1  (cost=0.57..1847781.73
> rows
> > =39998181 width=8)
> >                                                    Output: l1.l_orderkey,
> l1.l_partkey, l1.l_suppkey, l1.l_linenumber, l1.l_quantity, l1.l_extende
> > dprice, l1.l_discount, l1.l_tax, l1.l_returnflag, l1.l_linestatus,
> l1.l_shipdate, l1.l_commitdate, l1.l_receiptdate, l1.l_shipinstruct,
> l1.l_shipm
> > ode, l1.l_comment
> >                                              ->  Hash
> (cost=3331161.44..3331161.44 rows=119994544 width=8)
> >                                                    Output: l2.l_orderkey,
> l2.l_suppkey
> >                                                    ->  Seq Scan on
> public.lineitem l2  (cost=0.00..3331161.44 rows=119994544 width=8)
> >                                                          Output:
> l2.l_orderkey, l2.l_suppkey
> >                                        ->  Seq Scan on public.supplier
> (cost=0.00..6358.00 rows=200000 width=34)
> >                                              Output: supplier.s_suppkey,
> supplier.s_name, supplier.s_address, supplier.s_nationkey, supplier.s_pho
> > ne, supplier.s_acctbal, supplier.s_comment
> >                                  ->  Index Scan using
> orders_o_orderkey_o_orderdate_idx on public.orders  (cost=0.56..7.98 rows=1
> width=4)
> >                                        Output: orders.o_orderkey,
> orders.o_custkey, orders.o_orderstatus, orders.o_totalprice,
> orders.o_orderdate,
> >  orders.o_orderpriority, orders.o_clerk, orders.o_shippriority,
> orders.o_comment
> >                                        Index Cond: (orders.o_orderkey =
> l1.l_orderkey)
> >                                        Filter: (orders.o_orderstatus =
> 'F'::bpchar)
> >                            ->  Seq Scan on public.nation  (cost=0.00..1.31
> rows=1 width=4)
> >                                  Output: nation.n_nationkey, nation.n_name,
> nation.n_regionkey, nation.n_comment
> >                                  Filter: (nation.n_name = 'UNITED
> KINGDOM'::bpchar)
> >                      ->  Index Scan using lineitem_l_orderkey_idx_part1 on
> public.lineitem l3  (cost=0.57..13.69 rows=89 width=8)
> >                            Output: l3.l_orderkey, l3.l_partkey,
> l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice,
> l3.l_discount, l
> > 3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate,
> l3.l_receiptdate, l3.l_shipinstruct, l3.l_shipmode, l3.l_comment
> >                            Index Cond: (l3.l_orderkey = l1.l_orderkey)
> >                            Filter: (l3.l_suppkey <> l1.l_suppkey)
> 
> curious: what was work_mem set to?
> 
> merlin

Re: DBT-3 with SF=20 got failed

From
Jan Wieck
Date:
On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:
>> curious: what was work_mem set to?
>>
> work_mem=48GB
>
> My machine mounts 256GB physical RAM.

work_mem can be allocated several times per backend. Nodes like sort and 
hash_aggregate may each allocate that much. You should set work_mem to a 
fraction of physical-RAM / concurrent-connections depending on the 
complexity of your queries. 48GB does not sound reasonable.


Regards, Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
> Indeed, this hash table is constructed towards the relation with nrows=119994544,
> so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think.  It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Kohei KaiGai
Date:
2015-06-11 23:20 GMT+09:00 Jan Wieck <jan@wi3ck.info>:
> On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:
>>>
>>> curious: what was work_mem set to?
>>>
>> work_mem=48GB
>>
>> My machine mounts 256GB physical RAM.
>
>
> work_mem can be allocated several times per backend. Nodes like sort and
> hash_aggregate may each allocate that much. You should set work_mem to a
> fraction of physical-RAM / concurrent-connections depending on the
> complexity of your queries. 48GB does not sound reasonable.
>
Smaller number of max_connections and large work_mem configuration are
usual for typical OLAP workloads.

Even if configuration is not reasonable, it is not a right error message.
People cannot understand how to fix it.

psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 06/11/15 16:20, Jan Wieck wrote:
> On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:
>>> curious: what was work_mem set to?
>>>
>> work_mem=48GB
>>
>> My machine mounts 256GB physical RAM.
>
> work_mem can be allocated several times per backend. Nodes like sort
> and hash_aggregate may each allocate that much. You should set
> work_mem to a fraction of physical-RAM / concurrent-connections
> depending on the complexity of your queries. 48GB does not sound
> reasonable.

That's true, but there are cases where values like this may be useful 
(e.g. for a particular query). We do allow such work_mem values, so I 
consider this failure to be a bug.

It probably existed in the past, but was amplified by the hash join 
improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead 
of NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we 
also much more memory than we had in the past.

Interestingly, the hash code checks for INT_MAX overflows on a number of 
places, but does not check for this ...

regards
--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Kohei KaiGai
Date:
2015-06-11 23:28 GMT+09:00 Robert Haas <robertmhaas@gmail.com>:
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
>> Indeed, this hash table is constructed towards the relation with nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think.
>
Oops, I forgot to attach indeed.

>  It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.
>
EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Attachment

Re: DBT-3 with SF=20 got failed

From
Kohei KaiGai
Date:
2015-06-11 23:33 GMT+09:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
> Hi,
>
> On 06/11/15 16:20, Jan Wieck wrote:
>>
>> On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:
>>>>
>>>> curious: what was work_mem set to?
>>>>
>>> work_mem=48GB
>>>
>>> My machine mounts 256GB physical RAM.
>>
>>
>> work_mem can be allocated several times per backend. Nodes like sort
>> and hash_aggregate may each allocate that much. You should set
>> work_mem to a fraction of physical-RAM / concurrent-connections
>> depending on the complexity of your queries. 48GB does not sound
>> reasonable.
>
>
> That's true, but there are cases where values like this may be useful (e.g.
> for a particular query). We do allow such work_mem values, so I consider
> this failure to be a bug.
>
> It probably existed in the past, but was amplified by the hash join
> improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead of
> NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we also
> much more memory than we had in the past.
>
> Interestingly, the hash code checks for INT_MAX overflows on a number of
> places, but does not check for this ...
>
Which number should be changed in this case?

Indeed, nbuckets is declared as int, so INT_MAX is hard limit of hash-slot.
However, some extreme usage can easily create a situation that we shall
touch this restriction.

Do we have nbuckets using long int?

-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Interestingly, the hash code checks for INT_MAX overflows on a number of 
> places, but does not check for this ...

Yeah, and at least at one time there were checks to prevent the hash table
request from exceeding MaxAllocSize.  Did those get removed by somebody?
        regards, tom lane



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 06/11/15 16:54, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Interestingly, the hash code checks for INT_MAX overflows on a number of
>> places, but does not check for this ...
>
> Yeah, and at least at one time there were checks to prevent the hash
> table request from exceeding MaxAllocSize. Did those get removed by
> somebody?

I think the problem is in this piece of code:
  if ((hashtable->nbatch == 1) &&      (hashtable->nbuckets_optimal <= INT_MAX / 2) &&       /* overflow protection */
   (ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))      {          hashtable->nbuckets_optimal *= 2;
   hashtable->log2_nbuckets_optimal += 1;      }
 

ISTM it does not check against the max_pointers (that's only done in 
ExecChooseHashTableSize).



--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 06/11/15 16:28, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
>> Indeed, this hash table is constructed towards the relation with nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think.  It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.

Actually, HashJoinTuple is just a pointer, so it's 8 bytes, so 1GB is 
enough for 134217728 million rows, which is more than the 119994544 rows 
from the plan.

Also, looking at the error message again:
    ERROR:  invalid memory alloc request size 1073741824

but this time with beer goggles, I noticed that the amount reported is 
exactly 1GB. The backtrace also shows the error happens right inside 
ExecHashTableCreate (and not in the resize which may happen later), 
which means it gets the nbuckets from ExecChooseHashTableSize directly.

The resize is probably still broken as I mentioned before, but this 
crash before reaching that code as the estimates are high enough to 
trigger the issue. But ExecChooseHashTableSize is supposed to keep all 
the checks from previous versions, and I think it actually does.

But I don't see there any checks regarding the 1GB boundary. What I see 
is this:
  max_pointers = (work_mem * 1024L) / sizeof(void *);  max_pointers = Min(max_pointers, INT_MAX / 2);
  ...
  dbuckets = Min(dbuckets, max_pointers);

That has nothing to do with 1GB, and it's in the code since the time 
work_mem was limited by 2GB, so perhaps there was some reasoning that 
it's sufficient (because the tuples stored in the hash table will need 
more than 1/2 of the memory, or something like that).

But today this issue is more likely, because people have more RAM and 
use higher work_mem values, so the max_pointers value gets much higher. 
In the extreme it may get to INT_MAX/2, so ~1 billion, so the buckets 
would allocate ~8B on 64-bit machines (on 32-bit machines we'd also get 
twice the number of pointers, compared to 64 bits, but that's mostly 
irrelevant, because of the memory size limits).

It's also true, that the hash-join improvements in 9.5 - namely the 
decrease of NTUP_PER_BUCKET from 10 to 1, made this error more likely. 
With 9.4 we'd use only 16777216 buckets (128MB), because that gets us 
below 10 tuples per bucket. But now we're shooting for 1 tuple per 
bucket, so we end up with 131M buckets, and that's 1GB.

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's    necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit 
stupid not to be able to use fast hash table because there's some 
artificial limit. Are there any fundamental reasons not to use the 
MemoryContextAllocHuge fix, proposed by KaiGai-san?


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
David Rowley
Date:
On 12 June 2015 at 02:40, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
2015-06-11 23:28 GMT+09:00 Robert Haas <robertmhaas@gmail.com>:
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
>> Indeed, this hash table is constructed towards the relation with nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think.
>
Oops, I forgot to attach indeed.

>  It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.
>
EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.


I've just run into this problem while running a TPC-H benchmark of 100GB, on a machine with 64GB of RAM.
When attempting to run Q21 with a work_mem of 10GB I'm getting:
 ERROR:  invalid memory alloc request size 1073741824

Setting work_mem to 1GB or less gets the query running.

I've patched the code with your patch Kohei, and it's now working.

Thought I'd better post this just in case this gets forgotten about.

Thanks

David

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

Re: DBT-3 with SF=20 got failed

From
Simon Riggs
Date:
On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
    necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit stupid not to be able to use fast hash table because there's some artificial limit. Are there any fundamental reasons not to use the MemoryContextAllocHuge fix, proposed by KaiGai-san?

If there are no objections, I will apply the patch for 2) to HEAD and backpatch to 9.5.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: DBT-3 with SF=20 got failed

From
Kohei KaiGai
Date:
2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:
> On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>>
>> I see two ways to fix this:
>>
>> (1) enforce the 1GB limit (probably better for back-patching, if that's
>>     necessary)
>>
>> (2) make it work with hash tables over 1GB
>>
>> I'm in favor of (2) if there's a good way to do that. It seems a bit
>> stupid not to be able to use fast hash table because there's some artificial
>> limit. Are there any fundamental reasons not to use the
>> MemoryContextAllocHuge fix, proposed by KaiGai-san?
>
>
> If there are no objections, I will apply the patch for 2) to HEAD and
> backpatch to 9.5.
>
Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==========
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2from web_sales ws1,web_sales ws2where
ws1.ws_order_number= ws2.ws_order_number  and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk;
                                QUERY PLAN
 


--------------------------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=25374644.08..1160509591.61 rows=60521928028
 
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)  Merge Cond: (ws1.ws_order_number =
ws2.ws_order_number) Join Filter: (ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)  Rows Removed by Join Filter: 127853313
-> Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
 
(actual time=73252.300..79017.420 rows=72001237 loops=1)        Sort Key: ws1.ws_order_number        Sort Method:
quicksort Memory: 7083296kB        ->  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
 
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)  ->  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)        Sort Key: ws2.ws_order_number        Sort Method:
quicksort Memory: 7083296kB        ->  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
 
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)Planning time: 0.232 msExecution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>



Re: DBT-3 with SF=20 got failed

From
Simon Riggs
Date:
On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:
> On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>>
>> I see two ways to fix this:
>>
>> (1) enforce the 1GB limit (probably better for back-patching, if that's
>>     necessary)
>>
>> (2) make it work with hash tables over 1GB
>>
>> I'm in favor of (2) if there's a good way to do that. It seems a bit
>> stupid not to be able to use fast hash table because there's some artificial
>> limit. Are there any fundamental reasons not to use the
>> MemoryContextAllocHuge fix, proposed by KaiGai-san?
>
>
> If there are no objections, I will apply the patch for 2) to HEAD and
> backpatch to 9.5.
>
Please don't be rush. :-)

Please explain what rush you see?
 
It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

 Yes, I can read both threads and would apply patches for each problem.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: DBT-3 with SF=20 got failed

From
Kohei KaiGai
Date:
2015-08-19 21:29 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:
> On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>>
>> 2015-08-19 20:12 GMT+09:00 Simon Riggs <simon@2ndquadrant.com>:
>> > On 12 June 2015 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> > wrote:
>> >
>> >>
>> >> I see two ways to fix this:
>> >>
>> >> (1) enforce the 1GB limit (probably better for back-patching, if that's
>> >>     necessary)
>> >>
>> >> (2) make it work with hash tables over 1GB
>> >>
>> >> I'm in favor of (2) if there's a good way to do that. It seems a bit
>> >> stupid not to be able to use fast hash table because there's some
>> >> artificial
>> >> limit. Are there any fundamental reasons not to use the
>> >> MemoryContextAllocHuge fix, proposed by KaiGai-san?
>> >
>> >
>> > If there are no objections, I will apply the patch for 2) to HEAD and
>> > backpatch to 9.5.
>> >
>> Please don't be rush. :-)
>
>
> Please explain what rush you see?
>
Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

>> It is not difficult to replace palloc() by palloc_huge(), however, it may
>> lead
>> another problem once planner gives us a crazy estimation.
>> Below is my comment on the another thread.
>
>
>  Yes, I can read both threads and would apply patches for each problem.
>
> --
> Simon Riggs                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>> Please don't be rush. :-)

> Please explain what rush you see?

Yours.  You appear to be in a hurry to apply patches that there's no
consensus on.

>> It is not difficult to replace palloc() by palloc_huge(), however, it
>> may lead another problem once planner gives us a crazy estimation.
>> Below is my comment on the another thread.

>  Yes, I can read both threads and would apply patches for each problem.

I don't see anything very wrong with constraining the initial allocation
to 1GB, or even less.  That will prevent consuming insane amounts of
work_mem when the planner's rows estimate is too high rather than too low.
And we do have the ability to increase the hash table size on the fly.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize.  Now that we know there is one, that should be
fixed (and not only in HEAD/9.5).  But I believe Kaigai-san withdrew his
initial proposed patch, and we don't have a replacement as far as I saw.
        regards, tom lane



Re: DBT-3 with SF=20 got failed

From
Simon Riggs
Date:
On 19 August 2015 at 14:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 19 August 2015 at 12:55, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>> Please don't be rush. :-)

> Please explain what rush you see?

Yours.  You appear to be in a hurry to apply patches that there's no
consensus on.

I think that comment is unreasonable.

The problem was reported 2 months ago; following independent confirmation of the proposed patch, I suggested committing it, with these words:

"If there are no objections, I will apply the patch for 2) to HEAD and backpatch to 9.5."

I was clearly waiting for objections before acting, to test whether there was consensus or not.

Please explain what procedure you would like committers to follow?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 08/19/2015 01:55 PM, Kohei KaiGai wrote:
>   Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
> width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
>     Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
>     Join Filter: (ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
>     Rows Removed by Join Filter: 127853313
>     ->  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
> (actual time=73252.300..79017.420 rows=72001237 loops=1)
>           Sort Key: ws1.ws_order_number
>           Sort Method: quicksort  Memory: 7083296kB
>           ->  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
> rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
> loops=1)
>     ->  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
> (actual time=65095.655..128885.811 rows=904010978 loops=1)
>           Sort Key: ws2.ws_order_number
>           Sort Method: quicksort  Memory: 7083296kB
>           ->  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
> rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
> loops=1)
>   Planning time: 0.232 ms
>   Execution time: 530176.521 ms
> (14 rows)
>
>
> So, even if we allows nodeHash.c to allocate hash buckets larger than
> 1GB, its initial size may be determined carefully.
> Probably, 1GB is a good starting point even if expanded later.

I'm not sure I understand what is the problem here? Could you elaborate?

The initial size of the hash table is determined using the estimate, and 
if we overestimate it will create more buckets (i.e. consuming more 
memory) and/or start batching (which might be unnecessary).

But I don't really see any "more careful" way to do this, without 
penalizing the cases where the estimate is actually correct - e.g. by 
starting with much smaller buckets (and then resizing the hash table, 
which is not free). Or by starting without batching, betting that we 
won't actually need it.

I think it'll be very difficult to get those working without causing 
real trouble to cases where we actually do have good estimates (and 
those are vast majority of queries).

But both of those are features, and we're dealing with a bug fix here.


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:
> Unless we have no fail-safe mechanism when planner estimated too
> large number of tuples than actual needs, a strange estimation will
> consume massive amount of RAMs. It's a bad side effect.
> My previous patch didn't pay attention to the scenario, so needs to
> revise the patch.

I agree we need to put a few more safeguards there (e.g. make sure we 
don't overflow INT when counting the buckets, which may happen with the 
amounts of work_mem we'll see in the wild soon).

But I think we should not do any extensive changes to how we size the 
hashtable - that's not something we should do in a bugfix I think.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 08/20/2015 04:15 AM, Tomas Vondra wrote:
> Hello KaiGain-san,
>
> On 08/19/2015 03:19 PM, Kohei KaiGai wrote:
>> Unless we have no fail-safe mechanism when planner estimated too
>> large number of tuples than actual needs, a strange estimation will
>> consume massive amount of RAMs. It's a bad side effect.
>> My previous patch didn't pay attention to the scenario, so needs to
>> revise the patch.
>
> I agree we need to put a few more safeguards there (e.g. make sure we
> don't overflow INT when counting the buckets, which may happen with the
> amounts of work_mem we'll see in the wild soon).
>
> But I think we should not do any extensive changes to how we size the
> hashtable - that's not something we should do in a bugfix I think.

Attached are two alternative version of a backpatch. Both limit the
nbuckets so that it never exceeds MaxAllocSize - effectively 512MB due
to the doubling (so ~67M buckets on 64-bit architectures).

The only difference is that the 'alternative' patch limits max_pointers

+       /* ensure we don't exceed the maximum allocation size */
+       max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));

so it affects both nbuckets and nbatches. That seems a bit more correct,
but I guess whoever gets this many batches would be grateful even for
the quick crash.


For master, I think the right patch is what KaiGai-san posted in June. I
don't think we should really try making it smarter about handling
overestimates at this point - that's 9.6 stuff IMNSHO.

I find it a bit awkward that we only have MemoryContextAllocHuge and
repalloc_huge, especially as nodeHash.c needs MemoryContextAllocHuge +
memset to zero the chunk.

So I think we should extend the memutils API by adding palloc0_huge (and
possibly palloc_huge, although that's not needed by nodeHash.c).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 08/19/2015 03:53 PM, Tom Lane wrote:
>
> I don't see anything very wrong with constraining the initial
> allocation to 1GB, or even less. That will prevent consuming insane
> amounts of work_mem when the planner's rows estimate is too high
> rather than too low. And we do have the ability to increase the hash
> table size on the fly.

Perhaps. Limiting the initial allocation to 1GB might help with lowering 
memory consumed in case of over-estimation, and it's high enough so that 
regular queries don't run into that.

My initial thought was "if the memory consumption is a problem, lower 
the work_mem" but after thinking about it for a while I don't see a 
reason to limit the memory consumption this way, if possible.

But what is the impact on queries that actually need more than 1GB of 
buckets? I assume we'd only limit the initial allocation and still allow 
the resize based on the actual data (i.e. the 9.5 improvement), so the 
queries would start with 1GB and then resize once finding out the 
optimal size (as done in 9.5). The resize is not very expensive, but 
it's not free either, and with so many tuples (requiring more than 1GB 
of buckets, i.e. ~130M tuples) it's probably just a noise in the total 
query runtime. But I'd be nice to see some proofs of that ...

Also, why 1GB and not 512MB (which is effectively the current limit on 
9.4, because we can't allocate 1GB there so we end up with 1/2 of that)? 
Why not some small percentage of work_mem, e.g. 5%?

Most importantly, this is mostly orthogonal to the original bug report. 
Even if we do this, we still have to fix the palloc after the resize.

So I'd say let's do a minimal bugfix of the actual issue, and then 
perhaps improve the behavior in case of significant overestimates. The 
bugfix should happen ASAP so that it gets into 9.5.0 (and backported). 
We already have patches for that.

Limiting the initial allocation deserves more thorough discussion and 
testing, and I'd argue it's 9.6 material at this point.

> The real problem here is the potential integer overflow in
> ExecChooseHashTableSize. Now that we know there is one, that should
> be fixed (and not only in HEAD/9.5).

I don't think there's any integer overflow. The problem is that we end 
up with nbuckets so high that (nbuckets*8) overflows 1GB-1.

There's a protection against integer overflow in place (it was there for 
a long time), but there never was any protection against the 1GB limit. 
Because we've been using much smaller work_mem values and 
NTUP_PER_BUCKET=10, so we could not actually reach it.

> But I believe Kaigai-san withdrew his initial proposed patch, and we
> don't have a replacementas far as I saw.

I believe the patch proposed by KaiGai-san is the right one to fix the 
bug discussed in this thread. My understanding is KaiGai-san withdrew 
the patch as he wants to extend it to address the over-estimation issue.

I don't think we should do that - IMHO that's an unrelated improvement 
and should be addressed in a separate patch.


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Kouhei Kaigai
Date:
> On 08/19/2015 03:53 PM, Tom Lane wrote:
> >
> > I don't see anything very wrong with constraining the initial
> > allocation to 1GB, or even less. That will prevent consuming insane
> > amounts of work_mem when the planner's rows estimate is too high
> > rather than too low. And we do have the ability to increase the hash
> > table size on the fly.
>
> Perhaps. Limiting the initial allocation to 1GB might help with lowering
> memory consumed in case of over-estimation, and it's high enough so that
> regular queries don't run into that.
>
> My initial thought was "if the memory consumption is a problem, lower
> the work_mem" but after thinking about it for a while I don't see a
> reason to limit the memory consumption this way, if possible.
>
> But what is the impact on queries that actually need more than 1GB of
> buckets? I assume we'd only limit the initial allocation and still allow
> the resize based on the actual data (i.e. the 9.5 improvement), so the
> queries would start with 1GB and then resize once finding out the
> optimal size (as done in 9.5). The resize is not very expensive, but
> it's not free either, and with so many tuples (requiring more than 1GB
> of buckets, i.e. ~130M tuples) it's probably just a noise in the total
> query runtime. But I'd be nice to see some proofs of that ...
>
The problem here is we cannot know exact size unless Hash node doesn't
read entire inner relation. All we can do is relying planner's estimation,
however, it often computes a crazy number of rows.
I think resizing of hash buckets is a reasonable compromise.

> Also, why 1GB and not 512MB (which is effectively the current limit on
> 9.4, because we can't allocate 1GB there so we end up with 1/2 of that)?
> Why not some small percentage of work_mem, e.g. 5%?
>
No clear reason at this moment.

> Most importantly, this is mostly orthogonal to the original bug report.
> Even if we do this, we still have to fix the palloc after the resize.
>
> So I'd say let's do a minimal bugfix of the actual issue, and then
> perhaps improve the behavior in case of significant overestimates. The
> bugfix should happen ASAP so that it gets into 9.5.0 (and backported).
> We already have patches for that.
>
> Limiting the initial allocation deserves more thorough discussion and
> testing, and I'd argue it's 9.6 material at this point.
>
Indeed, the original bug was caused by normal MemoryContextAlloc().
The issue around memory over consumption is a problem newly caused
by this. I didn't notice it two months before.

> > The real problem here is the potential integer overflow in
> > ExecChooseHashTableSize. Now that we know there is one, that should
> > be fixed (and not only in HEAD/9.5).
>
> I don't think there's any integer overflow. The problem is that we end
> up with nbuckets so high that (nbuckets*8) overflows 1GB-1.
>
> There's a protection against integer overflow in place (it was there for
> a long time), but there never was any protection against the 1GB limit.
> Because we've been using much smaller work_mem values and
> NTUP_PER_BUCKET=10, so we could not actually reach it.
>
> > But I believe Kaigai-san withdrew his initial proposed patch, and we
> > don't have a replacementas far as I saw.
>
> I believe the patch proposed by KaiGai-san is the right one to fix the
> bug discussed in this thread. My understanding is KaiGai-san withdrew
> the patch as he wants to extend it to address the over-estimation issue.
>
> I don't think we should do that - IMHO that's an unrelated improvement
> and should be addressed in a separate patch.
>
OK, it might not be a problem we should conclude within a few days, just
before the beta release.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hello KaiGai-san,

On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
...
>>
>> But what is the impact on queries that actually need more than 1GB
>> of buckets? I assume we'd only limit the initial allocation and
>> still allow the resize based on the actual data (i.e. the 9.5
>> improvement), so the queries would start with 1GB and then resize
>> once finding out the optimal size (as done in 9.5). The resize is
>> not very expensive, but it's not free either, and with so many
>> tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
>> probably just a noise in the total query runtime. But I'd be nice
>> to see some proofs of that ...
>>
> The problem here is we cannot know exact size unless Hash node
> doesn't read entire inner relation. All we can do is relying
> planner's estimation, however, it often computes a crazy number of
> rows. I think resizing of hash buckets is a reasonable compromise.

I understand the estimation problem. The question I think we need to 
answer is how to balance the behavior for well- and poorly-estimated 
cases. It'd be unfortunate if we lower the memory consumption in the 
over-estimated case while significantly slowing down the well-estimated 
ones.

I don't think we have a clear answer at this point - maybe it's not a 
problem at all and it'll be a win no matter what threshold we choose. 
But it's a separate problem from the bugfix.

>> I believe the patch proposed by KaiGai-san is the right one to fix
>> the bug discussed in this thread. My understanding is KaiGai-san
>> withdrew the patch as he wants to extend it to address the
>> over-estimation issue.
>>
>> I don't think we should do that - IMHO that's an unrelated
>> improvement and should be addressed in a separate patch.
>>
> OK, it might not be a problem we should conclude within a few days,
> just before the beta release.

I don't quite see a reason to wait for the over-estimation patch. We 
probably should backpatch the bugfix anyway (although it's much less 
likely to run into that before 9.5), and we can't really backpatch the 
behavior change there (as there's no hash resize).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Kouhei Kaigai
Date:
> Hello KaiGai-san,
> 
> On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
> ...
> >>
> >> But what is the impact on queries that actually need more than 1GB
> >> of buckets? I assume we'd only limit the initial allocation and
> >> still allow the resize based on the actual data (i.e. the 9.5
> >> improvement), so the queries would start with 1GB and then resize
> >> once finding out the optimal size (as done in 9.5). The resize is
> >> not very expensive, but it's not free either, and with so many
> >> tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
> >> probably just a noise in the total query runtime. But I'd be nice
> >> to see some proofs of that ...
> >>
> > The problem here is we cannot know exact size unless Hash node
> > doesn't read entire inner relation. All we can do is relying
> > planner's estimation, however, it often computes a crazy number of
> > rows. I think resizing of hash buckets is a reasonable compromise.
> 
> I understand the estimation problem. The question I think we need to
> answer is how to balance the behavior for well- and poorly-estimated
> cases. It'd be unfortunate if we lower the memory consumption in the
> over-estimated case while significantly slowing down the well-estimated
> ones.
> 
> I don't think we have a clear answer at this point - maybe it's not a
> problem at all and it'll be a win no matter what threshold we choose.
> But it's a separate problem from the bugfix.
>
I agree with this is a separate (and maybe not easy) problem.

If somebody know previous research in academic area, please share with us. 

> >> I believe the patch proposed by KaiGai-san is the right one to fix
> >> the bug discussed in this thread. My understanding is KaiGai-san
> >> withdrew the patch as he wants to extend it to address the
> >> over-estimation issue.
> >>
> >> I don't think we should do that - IMHO that's an unrelated
> >> improvement and should be addressed in a separate patch.
> >>
> > OK, it might not be a problem we should conclude within a few days,
> > just before the beta release.
> 
> I don't quite see a reason to wait for the over-estimation patch. We
> probably should backpatch the bugfix anyway (although it's much less
> likely to run into that before 9.5), and we can't really backpatch the
> behavior change there (as there's no hash resize).
>
I don't argue this bugfix anymore.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

   memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: DBT-3 with SF=20 got failed

From
Kouhei Kaigai
Date:
> Hello KaiGai-san,
> 
> I've discovered a bug in the proposed patch - when resetting the hash
> table after the first batch, it does this:
> 
> memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
> 
> The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
> the array (usually resulting in crashes due to invalid pointers).
> 
> I fixed it to
> 
>    memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
> 
> Fixed patch attached (marked as v2).
>
Thanks, it was my bug, but oversight.

I want committer to push this fix.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> Hello KaiGai-san,
>>
>> I've discovered a bug in the proposed patch - when resetting the hash
>> table after the first batch, it does this:
>>
>> memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
>>
>> The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
>> the array (usually resulting in crashes due to invalid pointers).
>>
>> I fixed it to
>>
>>    memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
>>
>> Fixed patch attached (marked as v2).
>>
> Thanks, it was my bug, but oversight.
>
> I want committer to push this fix.

I'm not in agreement with this fix, and would prefer to instead
proceed by limiting the initial allocation to 1GB.  Otherwise, as
KaiGai has mentioned, we might end up trying to allocate completely
unreasonable amounts of memory if the planner gives a bad estimate.
Of course, it's true (as Tomas points out) that this issue already
exists today to some degree, and it's also true (as he also points
out) that 1GB is an arbitrary limit.  But it's also true that we use
that same arbitrary 1GB limit in a lot of places, so it's hardly
without precedent.

More importantly, removing the cap on the allocation size makes the
problem a lot worse.  You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem.  If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably.  Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/08/2015 08:44 PM, Robert Haas wrote:
> On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>>> Hello KaiGai-san,
>>>
>>> I've discovered a bug in the proposed patch - when resetting the hash
>>> table after the first batch, it does this:
>>>
>>> memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
>>>
>>> The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
>>> the array (usually resulting in crashes due to invalid pointers).
>>>
>>> I fixed it to
>>>
>>>     memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
>>>
>>> Fixed patch attached (marked as v2).
>>>
>> Thanks, it was my bug, but oversight.
>>
>> I want committer to push this fix.
>
> I'm not in agreement with this fix, and would prefer to instead
> proceed by limiting the initial allocation to 1GB.  Otherwise, as
> KaiGai has mentioned, we might end up trying to allocate completely
> unreasonable amounts of memory if the planner gives a bad estimate.
> Of course, it's true (as Tomas points out) that this issue already
> exists today to some degree, and it's also true (as he also points
> out) that 1GB is an arbitrary limit.  But it's also true that we use
> that same arbitrary 1GB limit in a lot of places, so it's hardly
> without precedent.

AFAIK the limit is not 1GB, but 512MB (because we use 2^N and the actual 
limit is 1GB-1B). But that's a minor detail.

Also, I'm not sure what other places do you have in mind (could you list 
some examples?) but I'd bet we limit the allocation to 1GB because of 
the palloc() limit and not because of fear of over-estimates.

> More importantly, removing the cap on the allocation size makes the
> problem a lot worse.  You might be sad if a bad planner estimate
> causes the planner to allocate 1GB when 64MB would have been enough,
> but on modern systems it is not likely to be an enormous problem.  If
> a similar mis-estimation causes the planner to allocate 16GB rather
> than 1GB, the opportunity for you to be sad is magnified pretty
> considerably.  Therefore, I don't really see the over-estimation bug
> fix as being separate from this one.

Perhaps. But if you want to absolutely prevent such sadness then maybe 
you should not set work_mem that high?

Anyway, I do see this as a rather orthogonal problem - an independent 
improvement, mostly unrelated to the bugfix. Even if we decide to 
redesign it like this (and I'm not particularly opposed to that, 
assuming someone takes the time to measure how expensive the additional 
resize actually is), we'll still have to fix the repalloc().

So I still fail to see why we shouldn't apply this fix.

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Also, I'm not sure what other places do you have in mind (could you list
> some examples?) but I'd bet we limit the allocation to 1GB because of the
> palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun.  But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

>> More importantly, removing the cap on the allocation size makes the
>> problem a lot worse.  You might be sad if a bad planner estimate
>> causes the planner to allocate 1GB when 64MB would have been enough,
>> but on modern systems it is not likely to be an enormous problem.  If
>> a similar mis-estimation causes the planner to allocate 16GB rather
>> than 1GB, the opportunity for you to be sad is magnified pretty
>> considerably.  Therefore, I don't really see the over-estimation bug
>> fix as being separate from this one.
>
> Perhaps. But if you want to absolutely prevent such sadness then maybe you
> should not set work_mem that high?

I think that's a red herring for a number of reasons.  One, the
allocation for the hash buckets is only a small portion of the total
memory.  Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

> Anyway, I do see this as a rather orthogonal problem - an independent
> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
> it like this (and I'm not particularly opposed to that, assuming someone
> takes the time to measure how expensive the additional resize actually is),
> we'll still have to fix the repalloc().
>
> So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine.  I respect your opinion; I'm just
telling you mine, which happens to be different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
On 09/09/2015 03:55 PM, Robert Haas wrote:
> On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Also, I'm not sure what other places do you have in mind (could you list
>> some examples?) but I'd bet we limit the allocation to 1GB because of the
>> palloc() limit and not because of fear of over-estimates.
>
> I don't really think those two things are different from each other.
> The palloc() limit is a means of enforcing a general policy of
> limiting all allocations to 1GB except in places where we've made a
> very conscious decision to allow a specific exception.  This limit
> happens to dovetail nicely with the varlena size limit, so in many
> cases it is the exactly correct limit just for that reason.  But even
> when, as here, that's not at issue, it's still a useful limit, because
> there are many ways that some garbage value can get passed to palloc
> -- bad planner estimates, corrupted tuples, bugs in other parts of our
> code.  And at least on my old MacBook Pro (I haven't tested the
> current one), passing a sufficiently-large value to malloc() causes a
> kernel panic.  That's probably a particularly bad bug, but there are
> lots of systems where "accidentally" allocating an unreasonable amount
> of space will have all kinds of unpleasant consequences.  So, I
> believe that  palloc()'s limit improves the overall stability of the
> system considerably even if it causes some occasional annoyance.

I'm not really buying this. The 1GB has nothing to do with platform 
limits, it's there exactly to make it varlena-like (which has exactly 
the same limit), and because it allows using 32-bit int to track all the 
bytes. Neither of these is relevant here.

It has nothing to do with malloc() limits on various platforms, and if 
there really are such limits that we think we should worry about, we 
should probably address those properly. Not by and-aiding all the 
various places independently.

And most importantly, these platform limits would apply to both the 
initial allocation and to the subsequent resize. It's utterly useless to 
just "fix" the initial allocation and then allow failure when we try to 
resize the hash table.

> Most of the time, you can just palloc() and not worry too much about
> whether you're going to blow up the machine: you won't, because you
> aren't going to allocate more than 1GB.  Any place that wants to
> allocate more than that needs to be someplace where we can be pretty
> sure that we're not going to accidentally allocate some completely
> unreasonable amount of memory, like say 1TB.  Nothing in this
> discussion convinces me that this is such a place.  Note that

We're not going to allocate a completely unreasonable amount of memory, 
because there already are some checks in place.

Firstly, you can't really get buckets larger than ~25% of work_mem, 
because we the pointer has only 8B, while the HJ tuple has 16B plus the 
data (IIRC). For wider tuples the size of buckets further decreases.

Secondly, we limit the number of buckets to INT_MAX, so about 16GB 
(because buckets are just pointers). No matter how awful estimate you 
get (or how insanely high you set work_mem) you can't exceed this.

> tuplesort.c and tuplestore.c, the only existing callers of
> repalloc_huge, only allocate such large amounts of memory when they
> actually have enough tuples to justify it - it is always based on the
> actual number of tuples, never an estimate.  I think that would be a
> sound principle here, too.  Resizing the hash table to such a large
> size based on the actual load factor is very reasonable; starting with
> such a large size seems less so.  Admittedly, 512MB is an arbitrary
> point: and if it so happened that the limit was 256MB or 1GB or 128MB
> or even 2GB I wouldn't advocate for changing it just for fun. But
> you're saying we should just remove that limit altogether, and I think
> that's clearly unreasonable.  Do you really want to start out with a
> TB or even PB-sized hash table when the actual number of tuples is,
> say, one?  That may sound crazy, but I've seen enough bad query plans
> to know that, yes, we are sometimes off by nine orders of magnitude.
> This is not a hypothetical problem.

No, I'm not saying anything like that - I actually explicitly stated 
that I'm not against such change (further restricting the initial hash 
table size), if someone takes the time to do a bit of testing and 
provide some numbers.

Moreover as I explained there already are limits in place (25% of 
work_mem or 16GB, whichever is lower), so I don't really see the bugfix 
as unreasonable.

Maybe if we decide to lift this restriction (using int64 to address the 
buckets, which removes the 16GB limit) this issue will get much more 
pressing. But I guess hash tables handling 2B buckets will be enough for 
the near future.

>>> More importantly, removing the cap on the allocation size makes the
>>> problem a lot worse.  You might be sad if a bad planner estimate
>>> causes the planner to allocate 1GB when 64MB would have been enough,
>>> but on modern systems it is not likely to be an enormous problem.  If
>>> a similar mis-estimation causes the planner to allocate 16GB rather
>>> than 1GB, the opportunity for you to be sad is magnified pretty
>>> considerably.  Therefore, I don't really see the over-estimation bug
>>> fix as being separate from this one.
>>
>> Perhaps. But if you want to absolutely prevent such sadness then maybe you
>> should not set work_mem that high?
>
> I think that's a red herring for a number of reasons.  One, the
> allocation for the hash buckets is only a small portion of the total
> memory.  Two, the fact that you are OK with the hash table growing to
> a certain size does not mean that you want it to start out that big on
> the strength of a frequently-flawed planner estimate.

True, although I don't think the herring is entirely red. It might just 
as well be a mackerel ;-)

>> Anyway, I do see this as a rather orthogonal problem - an independent
>> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
>> it like this (and I'm not particularly opposed to that, assuming someone
>> takes the time to measure how expensive the additional resize actually is),
>> we'll still have to fix the repalloc().
>>
>> So I still fail to see why we shouldn't apply this fix.
>
> In all seriousness, that is fine.  I respect your opinion; I'm just
> telling you mine, which happens to be different.

Likewise.

Let me repeat my previous proposal:
 1) Let's apply the proposed bugfix (and also backpatch it), because    the current code *is* broken.
 2) Do a bunch of experiments with limiting the initial hash size,    decide whether the impact on well-estimated cases
isacceptable.
 

I'm strongly opposed to just limiting the initial size without actually 
measuring how expensive the resize is, as that simply adds cost to the 
well-estimated cases (which may easily be the vast majority of all the 
plans, as we tend to notice just the poorly estimated ones).


regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
> buckets are just pointers). No matter how awful estimate you get (or how
> insanely high you set work_mem) you can't exceed this.

OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit.  Right now, if we would have allocated more than 512MB,
we instead fail.  There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.

My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate.  Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.

I guess we'll need to wait for some other opinions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/11/2015 06:55 PM, Robert Haas wrote:
> On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
>> buckets are just pointers). No matter how awful estimate you get (or how
>> insanely high you set work_mem) you can't exceed this.
>
> OK, so this is an interesting point, and I think it clarifies things.
> Essentially, we're arguing about whether a 16GB limit is as good as a
> 512MB limit.  Right now, if we would have allocated more than 512MB,
> we instead fail.  There are two possible solutions:
>
> 1. I'm arguing for maintaining the 512MB limit, but by clamping the
> allocation to 512MB (and the number of buckets accordingly) so that it
> works with fewer buckets instead of failing.
>
> 2. You're arguing for removing the 512MB limit, allowing an initial
> allocation of up to 16GB.

I'm arguing for fixing the existing bug, and then addressing the case of 
over-estimation separately, with proper analysis.

>
> My judgement is that #2 could give some people a nasty surprise, in
> that such a large initial allocation might cause problems, especially
> if driven by a bad estimate.  Your judgement is that this is unlikely
> to be a problem, and that the performance consequences of limiting a
> hash join to an initial allocation of 64 million buckets rather than 2
> billion buckets are the thing to worry about.

Not quite, my judgment is that

- We shouldn't address this in this particular bugfix, because it's a  separete problem (even if we limit the initial
allocation,we still  have to fix the repalloc after we build the Hash).
 

- I assume the "might cause problems" refers to malloc() issues on some  platforms. In that case we still have to apply
itto both places, not  just to the initial allocation. I don't know if this is a problem (I  haven't heard any such
reportsuntil now), but if it is we better  address this consistently everywhere, not just this one place.
 

- I'm not really sure about the impact of the additional resize. I  surely don't want to significantly penalize the
well-estimatedcases,  so I'd like to see some numbers first.
 

>
> I guess we'll need to wait for some other opinions.
>

OK

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'm arguing for fixing the existing bug, and then addressing the case of
> over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/11/2015 07:16 PM, Robert Haas wrote:
> On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I'm arguing for fixing the existing bug, and then addressing the case of
>> over-estimation separately, with proper analysis.
>
> Well, this is part of how we're looking it differently.  I think the
> bug is "we're passing a value to palloc that is too large, so
> sometimes it fails" and the way to fix that is to properly limit the
> value.  You are clearly defining the bug a bit differently.

Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than 
using a function that can handle the actual value.

The proposed bugfix addresses the issue in the most straightforward way, 
without introducing additional considerations about possible 
over-estimations (which the current code completely ignores, so this is 
a new thing). I think bugfixes should not introduce such changes to 
behavior (albeit internal), especially not without any numbers.

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 09/11/2015 07:16 PM, Robert Haas wrote:
>> On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> I'm arguing for fixing the existing bug, and then addressing the case of
>>> over-estimation separately, with proper analysis.

>> Well, this is part of how we're looking it differently.  I think the
>> bug is "we're passing a value to palloc that is too large, so
>> sometimes it fails" and the way to fix that is to properly limit the
>> value.  You are clearly defining the bug a bit differently.

> Yes, I see it differently.

> I don't quite understand why limiting the value is more "proper" than 
> using a function that can handle the actual value.

> The proposed bugfix addresses the issue in the most straightforward way, 
> without introducing additional considerations about possible 
> over-estimations (which the current code completely ignores, so this is 
> a new thing). I think bugfixes should not introduce such changes to 
> behavior (albeit internal), especially not without any numbers.

This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should clamp
the initial pointer-array size to work with palloc (ie, 512MB worth of
pointers, which please recall is going to represent at least 10X that much
in hashtable entries, probably more).  The argument that allowing it to be
larger would be a performance win seems entirely unbased on any evidence,
while the risks of allowing arbitrarily large allocations based on planner
estimates seem pretty obvious to me.  And the argument that such
overestimates are a bug that we can easily fix is based on even less
evidence; in fact, I would dismiss that out of hand as hubris.

Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit.  There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review.  A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.

* ExecHashIncreaseNumBuckets starts out with a run-time test on something
that its sole caller has just Assert()'d to not be the case, and which
really ought to be impossible with or without that Assert.

* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether we've
created more than one batch or not.  Meanwhile, ExecHashIncreaseNumBatches
thinks it can change the number of buckets in any case.  Maybe that's all
okay, but at the very least those tests ought to look like they'd heard of
each other.  And of those three places, having the safety check where it
is seems like the least reasonable place.  Tracking an "optimal" number
of buckets seems like an activity that should not be constrained by
whether we have any hope of being able to apply the number.

So I'm not having a lot of faith that there aren't other bugs in the
immediate area.
        regards, tom lane



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:
Hi,

On 09/23/2015 11:25 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 09/11/2015 07:16 PM, Robert Haas wrote:
>>> On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> I'm arguing for fixing the existing bug, and then addressing the case of
>>>> over-estimation separately, with proper analysis.
>
>>> Well, this is part of how we're looking it differently.  I think the
>>> bug is "we're passing a value to palloc that is too large, so
>>> sometimes it fails" and the way to fix that is to properly limit the
>>> value.  You are clearly defining the bug a bit differently.
>
>> Yes, I see it differently.
>
>> I don't quite understand why limiting the value is more "proper" than
>> using a function that can handle the actual value.
>
>> The proposed bugfix addresses the issue in the most straightforward way,
>> without introducing additional considerations about possible
>> over-estimations (which the current code completely ignores, so this is
>> a new thing). I think bugfixes should not introduce such changes to
>> behavior (albeit internal), especially not without any numbers.
>
> This thread seems to have stalled out...
>
> After re-reading it, I'm going to agree with Robert that we should
> clamp the initial pointer-array size to work with palloc (ie, 512MB
> worth of pointers, which please recall is going to represent at
> least 10X that much in hashtable entries, probably more).

10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was 
reduced to 1 last year as part of the hashjoin improvements. So we do 
have more buckets than tuples (to keep load factor below 1.0). It's 
still true the entries will need more space than buckets (because of 
headers and such), but it may easily get well below 10x.

In the example reported by KaiGai-san, the entries are 8B wide (~24B 
with header), while buckets are 8B. That's 1:3 ratio. It is a bit 
extreme example because in other cases the tuples will be wider.

It also seems to me that the higher the ratio, the lower the need to 
actually impose such limit because it increases the natural pressure 
keeping buckets down (because both buckets and entries need to share 
work_mem of limited size).

> The argument that allowing it to be larger would be a performance win
> seems entirely unbased on any evidence, while the risks of allowing
> arbitrarily large allocations based on planner estimates seem pretty
> obvious to me.

Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.

Now, maybe it does not really matter in this case - we probably get 
slightly inaccurate estimates all the time. Not terribly wrong, but 
enough to make the initial number of buckets too low, so we may actually 
do the resize quite anyway. Also, if we're dealing with hash tables of 
this size, we're probably dealing with much larger outer relation and 
the additional resize is going to be just noise ...

I however quite dislike the dismissal of the possible impact. It should 
be the responsibility of the person introducing the change to show that 
no such impact actually exists, not just waving it off as "unbased on 
any evidence" when there's no evidence presented.

Had I been adding such limit, I'd do at least some testing and presented 
the results here. Perhaps I'm a bit over-protective of this piece of 
code as I've spent quite a bit of time getting it faster, but I believe 
the principle that the person proposing a change should demonstrate the 
(lack of) performance impact is sound.

> And the argument that such overestimates are a bug that we can easily
> fix is based on even less evidence; in fact, I would dismiss that out
> of hand as hubris.

I don't think anyone was suggesting the overestimates are easy to fix 
(or even possible to fix in general). If I ever claimed that, I was 
clearly wrong.

>
> Now there is a separate issue about whether we should allow hashtable
> resizes to exceed that limit.  There I would vote yes, as long as the
> resize is based on arrival of actual data and not estimates (following
> Robert's point that the existing uses of repalloc_huge are driven by
> actual need).
>
> So I don't like any of the proposed patches exactly as-is.
>
> BTW, just looking at the code in question, it seems to be desperately
> in need of editorial review.  A couple of examples:
>
> * Some of the code goes to the trouble of maintaining a
> log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
> know about that and recomputes the value.

Yeah, that's a stupid bug.

>
> * ExecHashIncreaseNumBuckets starts out with a run-time test on
> something that its sole caller has just Assert()'d to not be the
> case, and which really ought to be impossible with or without that
> Assert.

Yeah, right. Excessively defensive coding on my side (I think I've added 
the Assert later, or something).

>
> * ExecHashTableInsert believes it can only increase nbuckets_optimal
> if we're still in the first batch, but MultiExecHash() will call
> ExecHashIncreaseNumBuckets at the end of input-acquisition whether
> we've created more than one batch or not.  Meanwhile,
> ExecHashIncreaseNumBatches thinks it can change the number of buckets
> in any case.  Maybe that's all okay, but at the very least those
> tests ought to look like they'd heard of each other.  And of those
> three places, having the safety check where it is seems like the
> least reasonable place.

I don't quite understand where you see the problem. I believe the logic 
is sound, although adding a comment explaining it, but let me explain.

The only place where we actually mess with the number of buckets is 
ExecHashTableInsert() around line 880. All the other places are resizes 
of the hash table, rebuilding it to the optimal number of buckets.

The rebuild can happen in two situations - either at the end of the 
input (which is what happens in MultiExecHash), or when we start 
batching (ExecHashIncreaseNumBatches).

Once we're batching, there's no point in further messing with buckets 
because we assume using more buckets would overflow work_mem.

The fact that we're doing the resize within ExecHashIncreaseNumBatches 
is merely because of efficiency - we are going to walk the hash table 
anyway (because we need to get rid of ~1/2 the tuples), so we simply 
rebuild the buckets too.

> Tracking an "optimal" number of buckets seems like an activity that
> should not be constrained by whether we have any hope of being able
> to apply the number.

Not sure I understand?

>
> So I'm not having a lot of faith that there aren't other bugs in the
> immediate area.

Well, I surely am not infallible, but so far I haven't seen any proof of 
an actual bug, except for needlessly recomputing a value (which can only 
happen once), and excessive check on an already asserted value.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I however quite dislike the dismissal of the possible impact. It should be
> the responsibility of the person introducing the change to show that no such
> impact actually exists, not just waving it off as "unbased on any evidence"
> when there's no evidence presented.

So, we're talking about determining the behavior in a case that
currently fails.  Making it behave like a case that currently works
can't but be an improvement.  Making it do something that currently
never happens might be better still, or it might be equivalent, or it
might be worse.  I just don't buy the argument that somebody's got to
justify on performance grounds a decision not to allocate more memory
than we currently ever allocate.  That seems 100% backwards to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/24/2015 01:51 PM, Robert Haas wrote:
> On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I however quite dislike the dismissal of the possible impact. It should be
>> the responsibility of the person introducing the change to show that no such
>> impact actually exists, not just waving it off as "unbased on any evidence"
>> when there's no evidence presented.
>
> So, we're talking about determining the behavior in a case that
> currently fails. Making it behave like a case that currently works
> can't but be an improvement. Making it do something that currently
> never happens might be better still, or it might be equivalent, or
> it might be worse. I just don't buy the argument that somebody's got
> to justify on performance grounds a decision not to allocate more
> memory than we currently ever allocate. That seems 100% backwards to
> me.

Yes, it's true that if you hit the issue it fails, so I understand your 
view that it's a win to fix this by introducing the (arbitrary) limit. I 
disagree with this view because the limit changes at the limit - if you 
get a good estimate just below the limit, you get no resize, if you get 
slightly higher estimate you get resize.

So while it does not introduce behavior change in this particular case 
(because it fails, as you point out), it introduces a behavior change in 
general - it simply triggers behavior that does not happen below the 
limit. Would we accept the change if the proposed limit was 256MB, for 
example?

It also seems to me that we don't really need the hash table until after 
MultiExecHash(), so maybe building the hash table incrementally is just 
unnecessary and we could simply track the optimal number of buckets and 
build the buckets at the end of MultiExecHash (essentially at the place 
where we do the resize now). We'd have to walk the tuples and insert 
them into the buckets, but that seems more efficient than the 
incremental build (no data to support that at this point).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So while it does not introduce behavior change in this particular case
> (because it fails, as you point out), it introduces a behavior change in
> general - it simply triggers behavior that does not happen below the limit.
> Would we accept the change if the proposed limit was 256MB, for example?

So, I'm a huge fan of arbitrary limits.

That's probably the single thing I'll say this year that sounds most
like a troll, but it isn't.  I really, honestly believe that.
Doubling things is very sensible when they are small, but at some
point it ceases to be sensible.  The fact that we can't set a
black-and-white threshold as to when we've crossed over that line
doesn't mean that there is no line.  We can't imagine that the
occasional 32GB allocation when 4GB would have been optimal is no more
problematic than the occasional 32MB allocation when 4MB would have
been optimal.  Where exactly to put the divider is subjective, but
"what palloc will take" is not an obviously unreasonable barometer.

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Of course, if we can postpone sizing the hash table until after the
> input size is known, as you suggest, then that would be better still
> (but not back-patch material).

AFAICS, it works that way today as long as the hash fits in memory
(ie, single-batch).  We load into a possibly seriously undersized hash
table, but that won't matter for performance until we start probing it.
At the conclusion of loading, MultiExecHash will call
ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
table.  I doubt this can be improved on much.

It would be good if we could adjust the numbuckets choice at the
conclusion of the input phase for the multi-batch case as well.
The code appears to believe that wouldn't work, but I'm not sure if
it's right about that, or how hard it'd be to fix if so.
        regards, tom lane



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/24/2015 05:09 PM, Robert Haas wrote:
> On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So while it does not introduce behavior change in this particular
>> case (because it fails, as you point out), it introduces a behavior
>> change in general - it simply triggers behavior that does not
>> happen below the limit. Would we accept the change if the proposed
>> limit was 256MB, for example?
>
> So, I'm a huge fan of arbitrary limits.
>
> That's probably the single thing I'll say this year that sounds most
>  like a troll, but it isn't. I really, honestly believe that.
> Doubling things is very sensible when they are small, but at some
> point it ceases to be sensible. The fact that we can't set a
> black-and-white threshold as to when we've crossed over that line
> doesn't mean that there is no line. We can't imagine that the
> occasional 32GB allocation when 4GB would have been optimal is no
> more problematic than the occasional 32MB allocation when 4MB would
> have been optimal. Where exactly to put the divider is subjective,
> but "what palloc will take" is not an obviously unreasonable
> barometer.

There are two machines - one with 32GB of RAM and work_mem=2GB, the 
other one with 256GB of RAM and work_mem=16GB. The machines are hosting 
about the same data, just scaled accordingly (~8x more data on the large 
machine).

Let's assume there's a significant over-estimate - we expect to get 
about 10x the actual number of tuples, and the hash table is expected to 
almost exactly fill work_mem. Using the 1:3 ratio (as in the query at 
the beginning of this thread) we'll use ~512MB and ~4GB for the buckets, 
and the rest is for entries.

Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the 
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and 
~3.5GB (~1.3%) on the large machine.

How does it make any sense to address the 1.3% and not the 13%?


> Of course, if we can postpone sizing the hash table until after the
> input size is known, as you suggest, then that would be better still
>  (but not back-patch material).

This dynamic resize is 9.5-only anyway.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/24/2015 05:18 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Of course, if we can postpone sizing the hash table until after the
>> input size is known, as you suggest, then that would be better still
>> (but not back-patch material).
>
> AFAICS, it works that way today as long as the hash fits in memory
> (ie, single-batch).  We load into a possibly seriously undersized hash
> table, but that won't matter for performance until we start probing it.
> At the conclusion of loading, MultiExecHash will call
> ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
> table.  I doubt this can be improved on much.
>
> It would be good if we could adjust the numbuckets choice at the
> conclusion of the input phase for the multi-batch case as well.
> The code appears to believe that wouldn't work, but I'm not sure if
> it's right about that, or how hard it'd be to fix if so.

So you suggest to use a small hash table even when we expect batching?

That would be rather difficult to do because of the way we derive 
buckets and batches from the hash value - they must not overlap. The 
current code simply assumes that once we start batching the number of 
bits needed for buckets does not change anymore.

It's possible to rework of course - the initial version of the patch 
actually did just that (although it was broken in other ways).

But I think the real problem here is the batching itself - if we 
overestimate and start batching (while we could actually run with a 
single batch), we've already lost.

But what about computing the number of expected batches, but always 
start executing assuming no batching? And only if we actually fill 
work_mem, we start batching and use the expected number of batches?

I.e.

1) estimate nbatches, but use nbatches=1

2) run until exhausting work_mem

3) start batching, with the initially estimated number of batches


regards


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> But what about computing the number of expected batches, but always 
> start executing assuming no batching? And only if we actually fill 
> work_mem, we start batching and use the expected number of batches?

Hmm.  You would likely be doing the initial data load with a "too small"
numbuckets for single-batch behavior, but if you successfully loaded all
the data then you could resize the table at little penalty.  So yeah,
that sounds like a promising approach for cases where the initial rowcount
estimate is far above reality.

But I kinda thought we did this already, actually.
        regards, tom lane



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/24/2015 07:04 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> But what about computing the number of expected batches, but always
>> start executing assuming no batching? And only if we actually fill
>> work_mem, we start batching and use the expected number of batches?
>
> Hmm. You would likely be doing the initial data load with a "too
> small" numbuckets for single-batch behavior, but if you successfully
> loaded all the data then you could resize the table at little
> penalty. So yeah, that sounds like a promising approach for cases
> where the initial rowcount estimate is far above reality.

I don't understand the comment about "too small" numbuckets - isn't 
doing that the whole point of using the proposed limit? The batching is 
merely a consequence of how bad the over-estimate is.

> But I kinda thought we did this already, actually.

I don't think so - I believe we haven't modified this aspect at all. It 
may not have been as pressing thanks to NTUP_PER_BUCKET=10 in the past.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> There are two machines - one with 32GB of RAM and work_mem=2GB, the other
> one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
> same data, just scaled accordingly (~8x more data on the large machine).
>
> Let's assume there's a significant over-estimate - we expect to get about
> 10x the actual number of tuples, and the hash table is expected to almost
> exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
> of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
> for entries.
>
> Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
> buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
> ~3.5GB (~1.3%) on the large machine.
>
> How does it make any sense to address the 1.3% and not the 13%?

One of us is confused, because from here it seems like 448MB is 1.3%
of 32GB, not 13%.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/24/2015 07:42 PM, Robert Haas wrote:
> On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> There are two machines - one with 32GB of RAM and work_mem=2GB, the other
>> one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
>> same data, just scaled accordingly (~8x more data on the large machine).
>>
>> Let's assume there's a significant over-estimate - we expect to get about
>> 10x the actual number of tuples, and the hash table is expected to almost
>> exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
>> of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
>> for entries.
>>
>> Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
>> buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
>> ~3.5GB (~1.3%) on the large machine.
>>
>> How does it make any sense to address the 1.3% and not the 13%?
>
> One of us is confused, because from here it seems like 448MB is 1.3%
> of 32GB, not 13%.

Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the 
over-estimate in one case and not the other? We're wasting the same 
fraction of memory in both cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Robert Haas
Date:
On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Meh, you're right - I got the math wrong. It's 1.3% in both cases.
>
> However the question still stands - why should we handle the over-estimate
> in one case and not the other? We're wasting the same fraction of memory in
> both cases.

Well, I think we're going around in circles here.  It doesn't seem
likely that either of us will convince the other.

But for the record, I agree with you that in the scenario you lay out,
it's the about the same problem in both cases.  I could argue that
it's slightly different because of [ tedious and somewhat tenuous
argument omitted ], but I'll spare you that.  However, consider the
alternative scenario where, on the same machine, perhaps even in the
same query, we perform two hash joins, one of which involves hashing a
small table (say, 2MB) and one of which involves hashing a big table
(say, 2GB).  If the small query uses twice the intended amount of
memory, probably nothing bad will happen.  If the big query does the
same thing, a bad outcome is much more likely.  Say the machine has
16GB of RAM.  Well, a 2MB over-allocation is not going to break the
world.  A 2GB over-allocation very well might.

I really don't see why this is a controversial proposition.  It seems
clearly as daylight from here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: DBT-3 with SF=20 got failed

From
Tomas Vondra
Date:

On 09/25/2015 02:54 AM, Robert Haas wrote:
> On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Meh, you're right - I got the math wrong. It's 1.3% in both cases.
>>
>> However the question still stands - why should we handle the
>> over-estimate in one case and not the other? We're wasting the
>> samefraction of memory in both cases.
>
> Well, I think we're going around in circles here. It doesn't seem
> likely that either of us will convince the other.

Let's agree we disagree ;-) That's perfectly OK, no hard feelings.

> But for the record, I agree with you that in the scenario you lay
> out, it's the about the same problem in both cases. I could argue
> that it's slightly different because of [ tedious and somewhat
> tenuous argument omitted ], but I'll spare you that.

OK, although that makes kinda prevents further discussion.

> However, consider the alternative scenario where, on the same
> machine, perhaps even in the same query, we perform two hash joins,
> one of which involves hashing a small table (say, 2MB) and one of
> which involves hashing a big table (say, 2GB). If the small query
> uses twice the intended amount of memory, probably nothing bad will
> happen. If the big query does the same thing, a bad outcome is much
> more likely. Say the machine has 16GB of RAM. Well, a 2MB
> over-allocation is not going to break the world. A 2GB
> over-allocation very well might.

I've asked about case A. You've presented case B and shown that indeed, 
the limit seems to help here. I don't see how that makes any difference 
in case A, which I asked about?

> I really don't see why this is a controversial proposition. It seems
> clearly as daylight from here.

I wouldn't say controversial, but I do see the proposed solution as 
misguided because we're fixing A and claiming to also fix B. Not only 
we're not really fixing B, but may actually make it needlessly slower 
for people who don't have problems with B at all.

We've ran into problem with allocating more than MaxAllocSize. The 
proposed fix (imposing arbitrary limit) is also supposedly fixing 
over-estimation problems, but actually it not (IMNSHO).

And I think my view is supported by the fact that solutions that seem to 
actually fix the over-estimation properly emerged. I mean the "let's not 
build the buckets at all, until the very end" and "let's start with 
nbatches=0" discussed yesterday. (And I'm not saying that because I 
proposed those two things.)

Anyway, I think you're right we're going in circles here. I think we 
both presented all the arguments we had and we still disagree. I'm not 
going to continue with this - I'm unlikely to win an argument against 
two committers if that didn't happen until now. Thanks for the 
discussion though.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: DBT-3 with SF=20 got failed

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Anyway, I think you're right we're going in circles here. I think we 
> both presented all the arguments we had and we still disagree. I'm not 
> going to continue with this - I'm unlikely to win an argument against 
> two committers if that didn't happen until now. Thanks for the 
> discussion though.

Since we're hard up against the 9.5beta1 deadline, I've made an executive
decision to commit just the minimal change, which I view as being to
constrain the array size to MaxAllocSize where it has been all along.

I found a second rather serious bug in the new hash code while doing that
--- it failed to ensure nbuckets was actually a power of 2 --- which did
not improve my opinion of it one bit.

It's clear from this discussion that there's room for further improvement
in the hashtable sizing behavior, but I think that's 9.6 material at
this point.
        regards, tom lane