Thread: Odd out of memory problem.

Odd out of memory problem.

From
Andrew Dunstan
Date:
I'm not sure if this is a bug, but I have wrestling with this problem 
for a client.

Platform is Windows Servers 2003 64 bit, PostgreSQL 8.4.8., 4Gb RAM, 
running on an Amazon VM.

Shared buffers: 512Mb, work_mem: 25Mb. There are only a handful of 
connections to the database, and no other activity.

We are seeing the error shown below. The table in question has two 
columns (Oid, int) and roughly 43m rows. The only other thing remarkable 
about the settings is that effective_cache_size is set to 5Gb, which is 
clearly too high, but surely that shouldn't cause a memory error.

I'm really perplexed as to why this fairly simple query should cause an 
out of memory error:
   select loid, max(pageno) from ldata group by loid order by 2 desc   limit 10;

I can't see what I might be missing.


cheers

andrew
   TopMemoryContext: 49816 total in 6 blocks; 5384 free (7 chunks);   44432 used      TopTransactionContext: 8192 total
in1 blocks; 7696 free (0   chunks); 496 used      Record information cache: 8192 total in 1 blocks; 1800 free (0
chunks);6392 used      Type information cache: 8192 total in 1 blocks; 1800 free (0   chunks); 6392 used      Operator
classcache: 8192 total in 1 blocks; 3848 free (0   chunks); 4344 used      Operator lookup cache: 24576 total in 2
blocks;14072 free (6   chunks); 10504 used      MessageContext: 40960 total in 3 blocks; 29920 free (6 chunks);   11040
used     smgr relation table: 8192 total in 1 blocks; 2816 free (0   chunks); 5376 used      TransactionAbortContext:
32768total in 1 blocks; 32752 free (0   chunks); 16 used      Portal hash: 8192 total in 1 blocks; 3912 free (0
chunks);4280 used      PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used        PortalHeapMemory:
1024total in 1 blocks; 920 free (0 chunks);   104 used          ExecutorState: 8192 total in 1 blocks; 2144 free (1
chunks);  6048 used            TupleSort: 40984 total in 3 blocks; 24208 free (10 chunks);   16776 used
ExprContext:0 total in 0 blocks; 0 free (0 chunks); 0 used            AggContext: 864018432 total in 127 blocks; 3400
free(110   chunks); 864015032 used              TupleHashTable: 619175960 total in 95 blocks; 821528 free   (331
chunks);618354432 used            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used            ExprContext: 0
totalin 0 blocks; 0 free (0 chunks); 0 used            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcacheby OID: 8192 total in 1 blocks; 3376 free (0 chunks);   4816 used      CacheMemoryContext: 667696 total in 20
blocks;169960 free (2   chunks); 497736 used        pg_shdepend_reference_index: 1024 total in 1 blocks; 240 free   (0
chunks);784 used        pg_depend_depender_index: 1024 total in 1 blocks; 152 free (0   chunks); 872 used
pg_depend_reference_index:1024 total in 1 blocks; 152 free (0   chunks); 872 used        pg_largeobject_loid_pn_index:
1024total in 1 blocks; 280 free   (0 chunks); 744 used        pg_database_datname_index: 1024 total in 1 blocks; 344
free(0   chunks); 680 used        pg_index_indrelid_index: 1024 total in 1 blocks; 304 free (0   chunks); 720 used
 pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 152 free   (0 chunks); 872 used
pg_foreign_data_wrapper_name_index:1024 total in 1 blocks; 344   free (0 chunks); 680 used        pg_enum_oid_index:
1024total in 1 blocks; 344 free (0 chunks);   680 used        pg_class_relname_nsp_index: 1024 total in 1 blocks; 240
free(0   chunks); 784 used        pg_foreign_server_oid_index: 1024 total in 1 blocks; 344 free   (0 chunks); 680 used
     pg_statistic_relid_att_index: 1024 total in 1 blocks; 240 free   (0 chunks); 784 used
pg_cast_source_target_index:1024 total in 1 blocks; 240 free   (0 chunks); 784 used        pg_language_name_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_authid_oid_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_amop_fam_strat_index: 1024 total in 1 blocks; 88 free (0   chunks); 936 used
pg_index_indexrelid_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_ts_template_tmplname_index:
1024total in 1 blocks; 280 free   (0 chunks); 744 used        pg_ts_config_map_index: 1024 total in 1 blocks; 192 free
(0  chunks); 832 used        pg_opclass_oid_index: 1024 total in 1 blocks; 304 free (0   chunks); 720 used
pg_foreign_data_wrapper_oid_index:1024 total in 1 blocks; 344   free (0 chunks); 680 used
pg_auth_members_member_role_index:1024 total in 1 blocks; 280   free (0 chunks); 744 used        pg_ts_dict_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_conversion_default_index: 1024 total in 1 blocks; 128
free  (0 chunks); 896 used        pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 88   free (0 chunks); 936
used       pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 240   free (0 chunks); 784 used
pg_enum_typid_label_index:1024 total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_config_oid_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_user_mapping_oid_index: 1024 total in 1 blocks; 344 free (0
 chunks); 680 used        pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 192 free   (0 chunks); 832 used
pg_type_oid_index:1024 total in 1 blocks; 304 free (0 chunks);   720 used        pg_aggregate_fnoid_index: 1024 total
in1 blocks; 304 free (0   chunks); 720 used        pg_constraint_oid_index: 1024 total in 1 blocks; 344 free (0
chunks);680 used        pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 280 free   (0 chunks); 744 used
pg_ts_parser_prsname_index:1024 total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_config_cfgname_index:
1024total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_parser_oid_index: 1024 total in 1 blocks; 344 free
(0  chunks); 680 used        pg_operator_oid_index: 1024 total in 1 blocks; 304 free (0   chunks); 720 used
pg_namespace_nspname_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_ts_template_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_amop_opr_fam_index: 1024 total in 1 blocks; 240 free
(0  chunks); 784 used        pg_ts_dict_dictname_index: 1024 total in 1 blocks; 280 free (0   chunks); 744 used
pg_auth_members_role_member_index:1024 total in 1 blocks; 280   free (0 chunks); 744 used
pg_type_typname_nsp_index:1024 total in 1 blocks; 240 free (0   chunks); 784 used        pg_opfamily_oid_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_class_oid_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 152   free (0 chunks); 872 used
pg_attribute_relid_attnum_index:1024 total in 1 blocks; 240   free (0 chunks); 784 used        pg_proc_oid_index: 1024
totalin 1 blocks; 304 free (0 chunks);   720 used        pg_language_oid_index: 1024 total in 1 blocks; 344 free (0
chunks);680 used        pg_namespace_oid_index: 1024 total in 1 blocks; 344 free (0   chunks); 680 used
pg_database_oid_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_amproc_fam_proc_index: 1024
totalin 1 blocks; 88 free (0   chunks); 936 used        pg_authid_rolname_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_foreign_server_name_index: 1024 total in 1 blocks; 344 free   (0 chunks); 680 used
pg_attribute_relid_attnam_index:1024 total in 1 blocks; 240   free (0 chunks); 784 used        pg_conversion_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_user_mapping_user_server_index: 1024 total in 1
blocks;280   free (0 chunks); 744 used        pg_conversion_name_nsp_index: 1024 total in 1 blocks; 280 free   (0
chunks);744 used      MdSmgr: 8192 total in 1 blocks; 4240 free (0 chunks); 3952 used      LOCALLOCK hash: 8192 total
in1 blocks; 3912 free (0 chunks);   4280 used      Timezones: 79320 total in 2 blocks; 5968 free (0 chunks); 73352 used
    ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used   2012-03-25 17:18:01 EDT ERROR:  out of memory
 2012-03-25 17:18:01 EDT DETAIL:  Failed on request of size 20.   2012-03-25 17:18:01 EDT STATEMENT:  select loid,
max(pageno)from   ldata group by loid order by 2 desc limit 10;
 



Re: Odd out of memory problem.

From
Hans-Jürgen Schönig
Date:
hello,

does the problem show up on 2% of all problems after 2 weeks or so?
we had a similar problem on UNIX as well. it even materialized on 100 identical boxes (on 2% of them). it pops up
randomlyand never stops … 
i checked some code paths. some of those messages are direct output via stderr (not even elog).
unfortunately i did not manage to find a box where i could GDB to attack the problem .
it was 8.4.8 as well.

do you see a certain workload which would make the problem reproducable?

regards,
    hans



On Mar 26, 2012, at 5:03 PM, Andrew Dunstan wrote:

>
> I'm not sure if this is a bug, but I have wrestling with this problem for a client.
>
> Platform is Windows Servers 2003 64 bit, PostgreSQL 8.4.8., 4Gb RAM, running on an Amazon VM.
>
> Shared buffers: 512Mb, work_mem: 25Mb. There are only a handful of connections to the database, and no other
activity.
>
> We are seeing the error shown below. The table in question has two columns (Oid, int) and roughly 43m rows. The only
otherthing remarkable about the settings is that effective_cache_size is set to 5Gb, which is clearly too high, but
surelythat shouldn't cause a memory error. 
>
> I'm really perplexed as to why this fairly simple query should cause an out of memory error:
>
>   select loid, max(pageno) from ldata group by loid order by 2 desc
>   limit 10;
>
> I can't see what I might be missing.
>
>
> cheers
>
> andrew
>
>   TopMemoryContext: 49816 total in 6 blocks; 5384 free (7 chunks);
>   44432 used
>      TopTransactionContext: 8192 total in 1 blocks; 7696 free (0
>   chunks); 496 used
>      Record information cache: 8192 total in 1 blocks; 1800 free (0
>   chunks); 6392 used
>      Type information cache: 8192 total in 1 blocks; 1800 free (0
>   chunks); 6392 used
>      Operator class cache: 8192 total in 1 blocks; 3848 free (0
>   chunks); 4344 used
>      Operator lookup cache: 24576 total in 2 blocks; 14072 free (6
>   chunks); 10504 used
>      MessageContext: 40960 total in 3 blocks; 29920 free (6 chunks);
>   11040 used
>      smgr relation table: 8192 total in 1 blocks; 2816 free (0
>   chunks); 5376 used
>      TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0
>   chunks); 16 used
>      Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
>      PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
>        PortalHeapMemory: 1024 total in 1 blocks; 920 free (0 chunks);
>   104 used
>          ExecutorState: 8192 total in 1 blocks; 2144 free (1 chunks);
>   6048 used
>            TupleSort: 40984 total in 3 blocks; 24208 free (10 chunks);
>   16776 used
>            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
>            AggContext: 864018432 total in 127 blocks; 3400 free (110
>   chunks); 864015032 used
>              TupleHashTable: 619175960 total in 95 blocks; 821528 free
>   (331 chunks); 618354432 used
>            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
>            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
>            ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
>      Relcache by OID: 8192 total in 1 blocks; 3376 free (0 chunks);
>   4816 used
>      CacheMemoryContext: 667696 total in 20 blocks; 169960 free (2
>   chunks); 497736 used
>        pg_shdepend_reference_index: 1024 total in 1 blocks; 240 free
>   (0 chunks); 784 used
>        pg_depend_depender_index: 1024 total in 1 blocks; 152 free (0
>   chunks); 872 used
>        pg_depend_reference_index: 1024 total in 1 blocks; 152 free (0
>   chunks); 872 used
>        pg_largeobject_loid_pn_index: 1024 total in 1 blocks; 280 free
>   (0 chunks); 744 used
>        pg_database_datname_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_index_indrelid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 152 free
>   (0 chunks); 872 used
>        pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 344
>   free (0 chunks); 680 used
>        pg_enum_oid_index: 1024 total in 1 blocks; 344 free (0 chunks);
>   680 used
>        pg_class_relname_nsp_index: 1024 total in 1 blocks; 240 free (0
>   chunks); 784 used
>        pg_foreign_server_oid_index: 1024 total in 1 blocks; 344 free
>   (0 chunks); 680 used
>        pg_statistic_relid_att_index: 1024 total in 1 blocks; 240 free
>   (0 chunks); 784 used
>        pg_cast_source_target_index: 1024 total in 1 blocks; 240 free
>   (0 chunks); 784 used
>        pg_language_name_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_authid_oid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_amop_fam_strat_index: 1024 total in 1 blocks; 88 free (0
>   chunks); 936 used
>        pg_index_indexrelid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_ts_template_tmplname_index: 1024 total in 1 blocks; 280 free
>   (0 chunks); 744 used
>        pg_ts_config_map_index: 1024 total in 1 blocks; 192 free (0
>   chunks); 832 used
>        pg_opclass_oid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 344
>   free (0 chunks); 680 used
>        pg_auth_members_member_role_index: 1024 total in 1 blocks; 280
>   free (0 chunks); 744 used
>        pg_ts_dict_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_conversion_default_index: 1024 total in 1 blocks; 128 free
>   (0 chunks); 896 used
>        pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 88
>   free (0 chunks); 936 used
>        pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 240
>   free (0 chunks); 784 used
>        pg_enum_typid_label_index: 1024 total in 1 blocks; 280 free (0
>   chunks); 744 used
>        pg_ts_config_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_user_mapping_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 192 free
>   (0 chunks); 832 used
>        pg_type_oid_index: 1024 total in 1 blocks; 304 free (0 chunks);
>   720 used
>        pg_aggregate_fnoid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_constraint_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 280 free
>   (0 chunks); 744 used
>        pg_ts_parser_prsname_index: 1024 total in 1 blocks; 280 free (0
>   chunks); 744 used
>        pg_ts_config_cfgname_index: 1024 total in 1 blocks; 280 free (0
>   chunks); 744 used
>        pg_ts_parser_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_operator_oid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_namespace_nspname_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_ts_template_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_amop_opr_fam_index: 1024 total in 1 blocks; 240 free (0
>   chunks); 784 used
>        pg_ts_dict_dictname_index: 1024 total in 1 blocks; 280 free (0
>   chunks); 744 used
>        pg_auth_members_role_member_index: 1024 total in 1 blocks; 280
>   free (0 chunks); 744 used
>        pg_type_typname_nsp_index: 1024 total in 1 blocks; 240 free (0
>   chunks); 784 used
>        pg_opfamily_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_class_oid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 152
>   free (0 chunks); 872 used
>        pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 240
>   free (0 chunks); 784 used
>        pg_proc_oid_index: 1024 total in 1 blocks; 304 free (0 chunks);
>   720 used
>        pg_language_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_namespace_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_database_oid_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_amproc_fam_proc_index: 1024 total in 1 blocks; 88 free (0
>   chunks); 936 used
>        pg_authid_rolname_index: 1024 total in 1 blocks; 304 free (0
>   chunks); 720 used
>        pg_foreign_server_name_index: 1024 total in 1 blocks; 344 free
>   (0 chunks); 680 used
>        pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 240
>   free (0 chunks); 784 used
>        pg_conversion_oid_index: 1024 total in 1 blocks; 344 free (0
>   chunks); 680 used
>        pg_user_mapping_user_server_index: 1024 total in 1 blocks; 280
>   free (0 chunks); 744 used
>        pg_conversion_name_nsp_index: 1024 total in 1 blocks; 280 free
>   (0 chunks); 744 used
>      MdSmgr: 8192 total in 1 blocks; 4240 free (0 chunks); 3952 used
>      LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks);
>   4280 used
>      Timezones: 79320 total in 2 blocks; 5968 free (0 chunks); 73352 used
>      ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used
>   2012-03-25 17:18:01 EDT ERROR:  out of memory
>   2012-03-25 17:18:01 EDT DETAIL:  Failed on request of size 20.
>   2012-03-25 17:18:01 EDT STATEMENT:  select loid, max(pageno) from
>   ldata group by loid order by 2 desc limit 10;
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Odd out of memory problem.

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I'm really perplexed as to why this fairly simple query should cause an 
> out of memory error:
>     select loid, max(pageno) from ldata group by loid order by 2 desc
>     limit 10;

Looks like the group by/aggregate step is eating lots of memory:

>              AggContext: 864018432 total in 127 blocks; 3400 free (110
>     chunks); 864015032 used
>                TupleHashTable: 619175960 total in 95 blocks; 821528 free
>     (331 chunks); 618354432 used

A guess is that there are a huge number of distinct values of "loid" but
the planner fails to realize that and tries to use a hash aggregation.
Could we see EXPLAIN output for this query?
        regards, tom lane


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 11:18 AM, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> I'm really perplexed as to why this fairly simple query should cause an
>> out of memory error:
>>      select loid, max(pageno) from ldata group by loid order by 2 desc
>>      limit 10;
> Looks like the group by/aggregate step is eating lots of memory:
>
>>               AggContext: 864018432 total in 127 blocks; 3400 free (110
>>      chunks); 864015032 used
>>                 TupleHashTable: 619175960 total in 95 blocks; 821528 free
>>      (331 chunks); 618354432 used
> A guess is that there are a huge number of distinct values of "loid" but
> the planner fails to realize that and tries to use a hash aggregation.
> Could we see EXPLAIN output for this query?

Currently it shows:

Limit  (cost=19443025.87..19443025.89 rows=10 width=8  ->  Sort  (cost=19443025.87..19446451.29 rows=1370168 width=8)
    Sort Key: (max(pageno))        ->  GroupAggregate  (cost=18537785.99..19413417.03 rows=1370168 
 
width=8)              ->  Sort  (cost=18537785.99..18823953.97 rows=114467192 
width=8)                    Sort Key: loid                    ->  Seq Scan on ldata  (cost=0.00..1651163.92 
rows=114467192 width=8)



The table might have been analysed since I ran the query, though.

To answer Hans' question, we have seen the problem in other contexts. We 
first noticed this problem in a failure to restore large objects when 
running pg_restore.  The database has 43,998,486 LOs on 114,467,137 
pages. The largest of these is 2160 pages. We're currently running a 
test to see if we can successfully restore LOs by doing them in smaller 
batches rather than in a single transaction. However, this one seemed 
even odder than the LO problem.


cheers

andrew





Re: Odd out of memory problem.

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 03/26/2012 11:18 AM, Tom Lane wrote:
>> Could we see EXPLAIN output for this query?

> Currently it shows:

> Limit  (cost=19443025.87..19443025.89 rows=10 width=8
>    ->  Sort  (cost=19443025.87..19446451.29 rows=1370168 width=8)
>          Sort Key: (max(pageno))
>          ->  GroupAggregate  (cost=18537785.99..19413417.03 rows=1370168 
> width=8)
>                ->  Sort  (cost=18537785.99..18823953.97 rows=114467192 
> width=8)
>                      Sort Key: loid
>                      ->  Seq Scan on ldata  (cost=0.00..1651163.92 
> rows=114467192 width=8)

> The table might have been analysed since I ran the query, though.

That plan should not create a tuple hash table, so I think it's almost
certain that the plan changed.  It might be interesting to remove the
pg_statistic rows for the table and then see what plan you get.

> To answer Hans' question, we have seen the problem in other contexts. We 
> first noticed this problem in a failure to restore large objects when 
> running pg_restore.

[ scratches head... ]  I don't understand how or why pg_restore would be
executing such a query.
        regards, tom lane


Re: Odd out of memory problem.

From
Greg Stark
Date:
On Mon, Mar 26, 2012 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>              TupleHashTable: 619175960 total in 95 blocks; 821528 free
>   (331 chunks); 618354432 used

I think the plan you showed isn't the plan that's running out of
memory. I think it's running out of memory because it's using a Hash
Aggregate and underestimating the number of distinct elements. I would
set enable_hash_aggregate=false for this query (not for the whole
server, hash aggregates are quite useful in general).

This is a long-standing problem with hash aggregates. The problem is
that we don't have a clever way to fall back to a non-hash-aggregate
if the original estimated memory usage turns out to be way off. Just
spilling to disk the way hash joins do would be horrendously
expensive, we think. If we had an approach to doing so it would be
very useful because simply running out of memory is kind of rude.


--
greg


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 12:11 PM, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> On 03/26/2012 11:18 AM, Tom Lane wrote:
>>> Could we see EXPLAIN output for this query?
>> Currently it shows:
>> Limit  (cost=19443025.87..19443025.89 rows=10 width=8
>>     ->   Sort  (cost=19443025.87..19446451.29 rows=1370168 width=8)
>>           Sort Key: (max(pageno))
>>           ->   GroupAggregate  (cost=18537785.99..19413417.03 rows=1370168
>> width=8)
>>                 ->   Sort  (cost=18537785.99..18823953.97 rows=114467192
>> width=8)
>>                       Sort Key: loid
>>                       ->   Seq Scan on ldata  (cost=0.00..1651163.92
>> rows=114467192 width=8)
>> The table might have been analysed since I ran the query, though.
> That plan should not create a tuple hash table, so I think it's almost
> certain that the plan changed.  It might be interesting to remove the
> pg_statistic rows for the table and then see what plan you get.


Yeah, that gets us:


Limit  (cost=2223492.78..2223492.81 rows=10 width=8)  ->  Sort  (cost=2223492.78..2223493.28 rows=200 width=8)
SortKey: (max(pageno))        ->  HashAggregate  (cost=2223485.96..2223488.46 rows=200 width=8)              ->  Seq
Scanon ldata  (cost=0.00..1651154.64 
 
rows=114466264 width=8)


>
>> To answer Hans' question, we have seen the problem in other contexts. We
>> first noticed this problem in a failure to restore large objects when
>> running pg_restore.
> [ scratches head... ]  I don't understand how or why pg_restore would be
> executing such a query.
>
>             


It's not. I was explaining that we have seen memory failures in *other* 
contexts, not just this query. The restore fails after many hours on a 
call to lo_write().

cheers

andrew


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 12:20 PM, Greg Stark wrote:
> On Mon, Mar 26, 2012 at 4:03 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>>               TupleHashTable: 619175960 total in 95 blocks; 821528 free
>>    (331 chunks); 618354432 used
> I think the plan you showed isn't the plan that's running out of
> memory. I think it's running out of memory because it's using a Hash
> Aggregate and underestimating the number of distinct elements. I would
> set enable_hash_aggregate=false for this query (not for the whole
> server, hash aggregates are quite useful in general).
>
> This is a long-standing problem with hash aggregates. The problem is
> that we don't have a clever way to fall back to a non-hash-aggregate
> if the original estimated memory usage turns out to be way off. Just
> spilling to disk the way hash joins do would be horrendously
> expensive, we think. If we had an approach to doing so it would be
> very useful because simply running out of memory is kind of rude.


Ugh, Ok, thanks for the explanation.

So it looks like I'm running into two separate memory problems. Here's a 
trace from the other one

cheers

andrew
   TopMemoryContext: 49816 total in 6 blocks; 6840 free (9 chunks);   42976 used      Operator class cache: 8192 total
in1 blocks; 3848 free (0   chunks); 4344 used      Filesystem: 8192 total in 1 blocks; 7872 free (0 chunks); 320 used
  TopTransactionContext: 213901336 total in 10 blocks; 1796344 free   (25 chunks); 212104992 used        Combo CIDs:
755490840total in 100 blocks; 5161072 free (381   chunks); 750329768 used      MessageContext: 73728 total in 4 blocks;
44920free (15 chunks);   28808 used      smgr relation table: 8192 total in 1 blocks; 2816 free (0   chunks); 5376 used
    TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0   chunks); 16 used      Portal hash: 8192 total in
1blocks; 3912 free (0 chunks); 4280 used      PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
 PortalHeapMemory: 1024 total in 1 blocks; 920 free (0 chunks);   104 used          ExecutorState: 8192 total in 1
blocks;5880 free (1 chunks);   2312 used            ExprContext: 24576 total in 2 blocks; 13664 free (8   chunks);
10912used      Relcache by OID: 8192 total in 1 blocks; 3376 free (0 chunks);   4816 used      CacheMemoryContext:
405552total in 19 blocks; 14296 free (5   chunks); 391256 used        pg_largeobject_loid_pn_index: 1024 total in 1
blocks;240 free   (0 chunks); 784 used        pg_index_indrelid_index: 1024 total in 1 blocks; 304 free (0   chunks);
720used        pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 192 free   (0 chunks); 832 used
pg_foreign_data_wrapper_name_index:1024 total in 1 blocks; 344   free (0 chunks); 680 used        pg_enum_oid_index:
1024total in 1 blocks; 344 free (0 chunks);   680 used        pg_class_relname_nsp_index: 1024 total in 1 blocks; 280
free(0   chunks); 744 used        pg_foreign_server_oid_index: 1024 total in 1 blocks; 344 free   (0 chunks); 680 used
     pg_statistic_relid_att_index: 1024 total in 1 blocks; 280 free   (0 chunks); 744 used
pg_cast_source_target_index:1024 total in 1 blocks; 240 free   (0 chunks); 784 used        pg_language_name_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_authid_oid_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_amop_fam_strat_index: 1024 total in 1 blocks; 88 free (0   chunks); 936 used
pg_index_indexrelid_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_ts_template_tmplname_index:
1024total in 1 blocks; 280 free   (0 chunks); 744 used        pg_ts_config_map_index: 1024 total in 1 blocks; 192 free
(0  chunks); 832 used        pg_opclass_oid_index: 1024 total in 1 blocks; 304 free (0   chunks); 720 used
pg_foreign_data_wrapper_oid_index:1024 total in 1 blocks; 344   free (0 chunks); 680 used
pg_auth_members_member_role_index:1024 total in 1 blocks; 280   free (0 chunks); 744 used        pg_ts_dict_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_conversion_default_index: 1024 total in 1 blocks; 128
free  (0 chunks); 896 used        pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 128   free (0 chunks); 896
used       pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 240   free (0 chunks); 784 used
pg_enum_typid_label_index:1024 total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_config_oid_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_user_mapping_oid_index: 1024 total in 1 blocks; 344 free (0
 chunks); 680 used        pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 192 free   (0 chunks); 832 used
pg_type_oid_index:1024 total in 1 blocks; 304 free (0 chunks);   720 used        pg_aggregate_fnoid_index: 1024 total
in1 blocks; 344 free (0   chunks); 680 used        pg_constraint_oid_index: 1024 total in 1 blocks; 344 free (0
chunks);680 used        pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 280 free   (0 chunks); 744 used
pg_ts_parser_prsname_index:1024 total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_config_cfgname_index:
1024total in 1 blocks; 280 free (0   chunks); 744 used        pg_ts_parser_oid_index: 1024 total in 1 blocks; 344 free
(0  chunks); 680 used        pg_operator_oid_index: 1024 total in 1 blocks; 344 free (0   chunks); 680 used
pg_namespace_nspname_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_ts_template_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_amop_opr_fam_index: 1024 total in 1 blocks; 280 free
(0  chunks); 744 used        pg_ts_dict_dictname_index: 1024 total in 1 blocks; 280 free (0   chunks); 744 used
pg_auth_members_role_member_index:1024 total in 1 blocks; 280   free (0 chunks); 744 used
pg_type_typname_nsp_index:1024 total in 1 blocks; 240 free (0   chunks); 784 used        pg_opfamily_oid_index: 1024
totalin 1 blocks; 344 free (0   chunks); 680 used        pg_class_oid_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 152   free (0 chunks); 872 used
pg_attribute_relid_attnum_index:1024 total in 1 blocks; 240   free (0 chunks); 784 used        pg_proc_oid_index: 1024
totalin 1 blocks; 304 free (0 chunks);   720 used        pg_language_oid_index: 1024 total in 1 blocks; 344 free (0
chunks);680 used        pg_namespace_oid_index: 1024 total in 1 blocks; 344 free (0   chunks); 680 used
pg_database_oid_index:1024 total in 1 blocks; 304 free (0   chunks); 720 used        pg_amproc_fam_proc_index: 1024
totalin 1 blocks; 88 free (0   chunks); 936 used        pg_authid_rolname_index: 1024 total in 1 blocks; 304 free (0
chunks);720 used        pg_foreign_server_name_index: 1024 total in 1 blocks; 344 free   (0 chunks); 680 used
pg_attribute_relid_attnam_index:1024 total in 1 blocks; 280   free (0 chunks); 744 used        pg_conversion_oid_index:
1024total in 1 blocks; 344 free (0   chunks); 680 used        pg_user_mapping_user_server_index: 1024 total in 1
blocks;280   free (0 chunks); 744 used        pg_conversion_name_nsp_index: 1024 total in 1 blocks; 280 free   (0
chunks);744 used      MdSmgr: 8192 total in 1 blocks; 6160 free (0 chunks); 2032 used      LOCALLOCK hash: 8192 total
in1 blocks; 3912 free (0 chunks);   4280 used      Timezones: 79320 total in 2 blocks; 5968 free (0 chunks); 73352 used
    ErrorContext: 8192 total in 1 blocks; 8176 free (4 chunks); 16 used   2012-03-24 10:59:39 EDT ERROR:  out of memory
 2012-03-24 10:59:39 EDT DETAIL:  Failed on request of size 419430400.   2012-03-24 10:59:39 EDT STATEMENT:  SELECT
pg_catalog.lowrite(0,....
 



Re: Odd out of memory problem.

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 03/26/2012 12:11 PM, Tom Lane wrote:
>> That plan should not create a tuple hash table, so I think it's almost
>> certain that the plan changed.  It might be interesting to remove the
>> pg_statistic rows for the table and then see what plan you get.

> Yeah, that gets us:

> Limit  (cost=2223492.78..2223492.81 rows=10 width=8)
>    ->  Sort  (cost=2223492.78..2223493.28 rows=200 width=8)
>          Sort Key: (max(pageno))
>          ->  HashAggregate  (cost=2223485.96..2223488.46 rows=200 width=8)
>                ->  Seq Scan on ldata  (cost=0.00..1651154.64 
> rows=114466264 width=8)

Hm.  This illustrates that it's not too prudent to rely on a default
numdistinct estimate to decide that a hash aggregation is safe :-(.
We had probably better tweak the cost estimation rules to not trust
that.  Maybe, if we have a default estimate, we should take the worst
case estimate that the column might be unique?  That could still burn
us if the rowcount estimate was horribly wrong, but those are not nearly
as shaky as numdistinct estimates ...

>> [ scratches head... ]  I don't understand how or why pg_restore would be
>> executing such a query.

> It's not. I was explaining that we have seen memory failures in *other* 
> contexts, not just this query. The restore fails after many hours on a 
> call to lo_write().

Seems probably unrelated then.  Have you got a memory-usage dump for
that case?
        regards, tom lane


Re: Odd out of memory problem.

From
Greg Stark
Date:
On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>        Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381
>   chunks); 750329768 used

I think you'll have to catch Heikki's attention to get a good answer to this.

Is it possible this job is inserting and then updating (or deleteing)
the row it just inserted and doing a large number of such
insert/update operations all within the same transaction? Or perhaps
it's updating the same row over and over again?

--
greg


Re: Odd out of memory problem.

From
Simon Riggs
Date:
On Mon, Mar 26, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hm.  This illustrates that it's not too prudent to rely on a default
> numdistinct estimate to decide that a hash aggregation is safe :-(.
> We had probably better tweak the cost estimation rules to not trust
> that.  Maybe, if we have a default estimate, we should take the worst
> case estimate that the column might be unique?  That could still burn
> us if the rowcount estimate was horribly wrong, but those are not nearly
> as shaky as numdistinct estimates ...

The selectivity API should include some way of indicating the accuracy
of the answer, as well as the answer itself.

That way we could respond better in a wide range of circumstances.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Odd out of memory problem.

From
Heikki Linnakangas
Date:
On 26.03.2012 19:51, Greg Stark wrote:
> On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>>         Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381
>>    chunks); 750329768 used
>
> I think you'll have to catch Heikki's attention to get a good answer to this.
>
> Is it possible this job is inserting and then updating (or deleteing)
> the row it just inserted and doing a large number of such
> insert/update operations all within the same transaction? Or perhaps
> it's updating the same row over and over again?

.. and all that in different subtransactions.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Odd out of memory problem.

From
Greg Stark
Date:
On Mon, Mar 26, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  This illustrates that it's not too prudent to rely on a default
> numdistinct estimate to decide that a hash aggregation is safe :-(.
> We had probably better tweak the cost estimation rules to not trust
> that.  Maybe, if we have a default estimate, we should take the worst
> case estimate that the column might be unique?  That could still burn
> us if the rowcount estimate was horribly wrong, but those are not nearly
> as shaky as numdistinct estimates ...

Perhaps we should have two work_mem settings -- one for the target to
aim for and one for a hard(er) limit that we should ensure the worst
case falls under?

I have a sketch for how to handle spilling hash aggregates to disk in
my head. I'm not sure if it's worth the amount of complexity it would
require but I'll poke around a bit and see if it works out well.

--
greg


Re: Odd out of memory problem.

From
Heikki Linnakangas
Date:
On 26.03.2012 19:59, Heikki Linnakangas wrote:
> On 26.03.2012 19:51, Greg Stark wrote:
>> On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstan<andrew@dunslane.net>
>> wrote:
>>> Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381
>>> chunks); 750329768 used
>>
>> I think you'll have to catch Heikki's attention to get a good answer
>> to this.
>>
>> Is it possible this job is inserting and then updating (or deleteing)
>> the row it just inserted and doing a large number of such
>> insert/update operations all within the same transaction? Or perhaps
>> it's updating the same row over and over again?
>
> .. and all that in different subtransactions.

sorry, scratch that, they don't need to be in different subtransactions.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Odd out of memory problem.

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> I have a sketch for how to handle spilling hash aggregates to disk in
> my head. I'm not sure if it's worth the amount of complexity it would
> require but I'll poke around a bit and see if it works out well.

It'd be awfully nice if those could spill to disk.  I think that
currently that's the only plan type where a misestimate can lead to
hard failure rather than just slower-than-you'd-like.  Which is not
nice considering that the estimates are necessarily just estimates.

Could you give us a brain dump on the sketch?  I've never seen how to
do it without unreasonable overhead.
        regards, tom lane


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 01:06 PM, Heikki Linnakangas wrote:
> On 26.03.2012 19:59, Heikki Linnakangas wrote:
>> On 26.03.2012 19:51, Greg Stark wrote:
>>> On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstan<andrew@dunslane.net>
>>> wrote:
>>>> Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381
>>>> chunks); 750329768 used
>>>
>>> I think you'll have to catch Heikki's attention to get a good answer
>>> to this.
>>>
>>> Is it possible this job is inserting and then updating (or deleteing)
>>> the row it just inserted and doing a large number of such
>>> insert/update operations all within the same transaction? Or perhaps
>>> it's updating the same row over and over again?
>>
>> .. and all that in different subtransactions.
>
> sorry, scratch that, they don't need to be in different subtransactions.


It's all in a single transaction. In fact the solution I'm currently 
testing and seems to be working involves breaking it up into batches of 
a few thousand LOs restored per batch.

cheers

andrew


Re: Odd out of memory problem.

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 03/26/2012 01:06 PM, Heikki Linnakangas wrote:
>>> Is it possible this job is inserting and then updating (or deleteing)
>>> the row it just inserted and doing a large number of such
>>> insert/update operations all within the same transaction? Or perhaps
>>> it's updating the same row over and over again?

> It's all in a single transaction. In fact the solution I'm currently 
> testing and seems to be working involves breaking it up into batches of 
> a few thousand LOs restored per batch.

Hm.  The test case is just a straight pg_restore of lots and lots of LOs?
What pg_dump version was the dump made with?
        regards, tom lane


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 01:34 PM, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> On 03/26/2012 01:06 PM, Heikki Linnakangas wrote:
>>>> Is it possible this job is inserting and then updating (or deleteing)
>>>> the row it just inserted and doing a large number of such
>>>> insert/update operations all within the same transaction? Or perhaps
>>>> it's updating the same row over and over again?
>> It's all in a single transaction. In fact the solution I'm currently
>> testing and seems to be working involves breaking it up into batches of
>> a few thousand LOs restored per batch.
> Hm.  The test case is just a straight pg_restore of lots and lots of LOs?
> What pg_dump version was the dump made with?
>
>             


8.4.8, same as the target. We get the same issue whether we restore 
direct to the database from pg_restore or via a text dump.

cheers

andrew



Re: Odd out of memory problem.

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 03/26/2012 01:34 PM, Tom Lane wrote:
>> Hm.  The test case is just a straight pg_restore of lots and lots of LOs?
>> What pg_dump version was the dump made with?

> 8.4.8, same as the target. We get the same issue whether we restore 
> direct to the database from pg_restore or via a text dump.

I believe I see the issue: when creating/loading LOs, we first do a
lo_create (which in 8.4 makes a "page zero" tuple in pg_largeobject
containing zero bytes of data) and then lo_write, which will do a
heap_update to overwrite that tuple with data.  This is at the next
command in the same transaction, so the original tuple has to receive a
combo CID.  Net result: we accumulate one new combo CID per large object
loaded in the same transaction.  You can reproduce this without any
pg_dump involvement at all, using something like

create table mylos (id oid);
insert into mylos select lo_import('/tmp/junk') from generate_series(1,1000000);

The problem is gone in 9.0 and up because now we use a
pg_largeobject_metadata entry instead of a pg_largeobject row to flag
the existence of an empty large object.  I don't see any very practical
backend fix for the problem in 8.x.
        regards, tom lane


Re: Odd out of memory problem.

From
Greg Stark
Date:
On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> I have a sketch for how to handle spilling hash aggregates to disk in
>> my head. I'm not sure if it's worth the amount of complexity it would
>> require but I'll poke around a bit and see if it works out well.
>
> It'd be awfully nice if those could spill to disk.  I think that
> currently that's the only plan type where a misestimate can lead to
> hard failure rather than just slower-than-you'd-like.  Which is not
> nice considering that the estimates are necessarily just estimates.
>
> Could you give us a brain dump on the sketch?  I've never seen how to
> do it without unreasonable overhead.

Hm. So my original plan was dependent on adding the state-merge
function we've talked about in the past. Not all aggregate functions
necessarily can support such a function but I think all or nearly all
the builtin aggregates can. Certainly min,max, count, sum, avg,
stddev, array_agg can which are most of what people do. That would be
a function which can take two state variables and produce a new state
variable.

If you have this then you can sort and spill the hash table and start
a new hash table and keep going. When you're done you merge the hash
tables using something like heap merge but applying the state merge
function.

However now that I've started thinking about it further I think you
could solve it with less complexity by cheating in various ways. For
example if you limit the hash size to 1/2 of work_mem then you when
you reach that limit you could just stuff any tuple that doesn't match
a hash entry into a tuplesort with 1/2 of work_mem and do the regular
level break logic on the output of that.

Or if you can count on tuplesort to be a stable sort (and I think it
might be already) then you could just spill the sorted hash entries
and then switch to doing a tapesort for the rest. When you finish
merging you can read them back and do level break logic like normal.
If there is a partially computed state it will be the first one in the
equal group. If you can't count on tuplesort to be stable you could
dump the partially computed states to a different tape and do a merge
between it and the sorted output of the main data set.

The holy grail of this kind of merging of the two algorithms would be
something that keeps the hash table going and maintains an lru of hash
entries. When it grows too large it would spill the oldest partial
state. Then it would sort those states and merge them (possibly in a
single step).  That might be too complex to pull its weight given some
of the above sketches are probably simple enough.

--
greg


Re: Odd out of memory problem.

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Could you give us a brain dump on the sketch? �I've never seen how to
>> do it without unreasonable overhead.

> Hm. So my original plan was dependent on adding the state-merge
> function we've talked about in the past. Not all aggregate functions
> necessarily can support such a function but I think all or nearly all
> the builtin aggregates can. Certainly min,max, count, sum, avg,
> stddev, array_agg can which are most of what people do. That would be
> a function which can take two state variables and produce a new state
> variable.

I'd rather not invent new requirements for aggregate implementations
if we can avoid it.

> However now that I've started thinking about it further I think you
> could solve it with less complexity by cheating in various ways. For
> example if you limit the hash size to 1/2 of work_mem then you when
> you reach that limit you could just stuff any tuple that doesn't match
> a hash entry into a tuplesort with 1/2 of work_mem and do the regular
> level break logic on the output of that.

Or just start dumping such tuples into a tuplestore, while continuing to
process tuples that match the hashagg entries that are already in
existence.  Once the input is exhausted, read out the hashagg entries we
have, flush the hashagg table, start reading from the tuplestore.
Repeat as needed.

I like this idea because the only thing you give up is predictability of
the order of output of aggregated entries, which is something that a
hashagg isn't guaranteeing anyway.  In particular, we would still have a
guarantee that any one aggregate evaluation processes the matching
tuples in arrival order, which is critical for some aggregates.

The main problem I can see is that if we start to flush after work_mem
is X% full, we're essentially hoping that the state values for the
existing aggregates won't grow by more than 1-X%, which is safe for many
common aggregates but fails for some like array_agg().  Ultimately, for
ones like that, it'd probably be best to never consider hashing at all.
I guess we could invent an "unsafe for hash aggregation" flag for
aggregates that have unbounded state-size requirements.
        regards, tom lane


Re: Odd out of memory problem.

From
Hitoshi Harada
Date:
On Mon, Mar 26, 2012 at 5:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Could you give us a brain dump on the sketch?  I've never seen how to
>>> do it without unreasonable overhead.
>
>> Hm. So my original plan was dependent on adding the state-merge
>> function we've talked about in the past. Not all aggregate functions
>> necessarily can support such a function but I think all or nearly all
>> the builtin aggregates can. Certainly min,max, count, sum, avg,
>> stddev, array_agg can which are most of what people do. That would be
>> a function which can take two state variables and produce a new state
>> variable.
>
> I'd rather not invent new requirements for aggregate implementations
> if we can avoid it.
>
>> However now that I've started thinking about it further I think you
>> could solve it with less complexity by cheating in various ways. For
>> example if you limit the hash size to 1/2 of work_mem then you when
>> you reach that limit you could just stuff any tuple that doesn't match
>> a hash entry into a tuplesort with 1/2 of work_mem and do the regular
>> level break logic on the output of that.
>
> Or just start dumping such tuples into a tuplestore, while continuing to
> process tuples that match the hashagg entries that are already in
> existence.  Once the input is exhausted, read out the hashagg entries we
> have, flush the hashagg table, start reading from the tuplestore.
> Repeat as needed.
>
> I like this idea because the only thing you give up is predictability of
> the order of output of aggregated entries, which is something that a
> hashagg isn't guaranteeing anyway.  In particular, we would still have a
> guarantee that any one aggregate evaluation processes the matching
> tuples in arrival order, which is critical for some aggregates.
>
> The main problem I can see is that if we start to flush after work_mem
> is X% full, we're essentially hoping that the state values for the
> existing aggregates won't grow by more than 1-X%, which is safe for many
> common aggregates but fails for some like array_agg().  Ultimately, for
> ones like that, it'd probably be best to never consider hashing at all.
> I guess we could invent an "unsafe for hash aggregation" flag for
> aggregates that have unbounded state-size requirements.
>

According to what I've learned in the last couple of months, array_agg
is not ready for fallback ways like dumping to tuplestore.  Even
merge-state is not able for them.  The problem is that the executor
doesn't know how to serialize/deserialize the internal type trans
value.  So in one implementation, the existence of merge function is a
flag to switch back to sort grouping not hash aggregate and array_agg
is one of such aggregate functions.  That said, if you invent a new
flag to note the aggregate is not dump-ready, it'd be worth inventing
state merge function to aggregate infrastructure anyway.

So I can imagine a way without state-merge function nor dumping to
tuplestore would be to sort hash table content the rest of inputs so
that we can switch to sort grouping.  Since we have hash table, we can
definitely sort them in memory, and we can put to disk everything that
comes later than the fallback and read it after the scan finishes. Now
we have sorted state values and sorted input, we can continue the rest
of work.

Anyway the memory usage problem is not only array_agg and hash
aggregate.  Even if you can say the hash table exceeds X% of the
work_mem, how can you tell other operators such like Sort are not
using the rest of memory?  One approach I could see to avoid this is
assigning arbitrary amount of memory to each operator from work_mem
and calculate it locally.  But this approach is going to skew
occasionally and not perfect, either.


Thanks,
--
Hitoshi Harada


Re: Odd out of memory problem.

From
Robert Haas
Date:
On Tue, Mar 27, 2012 at 3:22 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> According to what I've learned in the last couple of months, array_agg
> is not ready for fallback ways like dumping to tuplestore.  Even
> merge-state is not able for them.  The problem is that the executor
> doesn't know how to serialize/deserialize the internal type trans
> value.  So in one implementation, the existence of merge function is a
> flag to switch back to sort grouping not hash aggregate and array_agg
> is one of such aggregate functions.  That said, if you invent a new
> flag to note the aggregate is not dump-ready, it'd be worth inventing
> state merge function to aggregate infrastructure anyway.
>
> So I can imagine a way without state-merge function nor dumping to
> tuplestore would be to sort hash table content the rest of inputs so
> that we can switch to sort grouping.  Since we have hash table, we can
> definitely sort them in memory, and we can put to disk everything that
> comes later than the fallback and read it after the scan finishes. Now
> we have sorted state values and sorted input, we can continue the rest
> of work.

It's a little bit tricky to make this work - you have to get all of
the values out of the hash-table you've built and stick them into a
Tuplesort object - but I think it can be made to work, and it seems
more elegant than anything else proposed so far.

I also agree with you and with Greg Stark that it would be good to
invent a state-merge function.  Although it wouldn't apply to every
case, it would make some very common cases a lot more efficient, both
in run time and in memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Odd out of memory problem.

From
Andrew Dunstan
Date:

On 03/26/2012 01:54 PM, Andrew Dunstan wrote:
>
>
> On 03/26/2012 01:34 PM, Tom Lane wrote:
>> Andrew Dunstan<andrew@dunslane.net>  writes:
>>> On 03/26/2012 01:06 PM, Heikki Linnakangas wrote:
>>>>> Is it possible this job is inserting and then updating (or deleteing)
>>>>> the row it just inserted and doing a large number of such
>>>>> insert/update operations all within the same transaction? Or perhaps
>>>>> it's updating the same row over and over again?
>>> It's all in a single transaction. In fact the solution I'm currently
>>> testing and seems to be working involves breaking it up into batches of
>>> a few thousand LOs restored per batch.
>> Hm.  The test case is just a straight pg_restore of lots and lots of 
>> LOs?
>> What pg_dump version was the dump made with?
>>
>>
>
>
> 8.4.8, same as the target. We get the same issue whether we restore 
> direct to the database from pg_restore or via a text dump.



Following this up, the workaround of making small batches of LOs did 
solve that memory issue. Here's what I did:
   pg_restore --list full_export.dmp | grep BLOB > bloblist   pg_restore --use-list=bloblist full_export.dmp | \
perl-n e  ' $n++  if /lo_open/; ' \            -e ' print " end; begin;\n" if (/lo_open/ && ($n % 1000 ==   0)); ' \
       -e ' print ;' |  \       psql -t -q -v ON_ERROR_STOP "dbname=adtest" > /dev/null
 


That's a fairly ugly hack to have to use.


cheers

andrew


Re: Odd out of memory problem.

From
Peter Eisentraut
Date:
On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote:
> Hm. So my original plan was dependent on adding the state-merge
> function we've talked about in the past. Not all aggregate functions
> necessarily can support such a function but I think all or nearly all
> the builtin aggregates can. Certainly min,max, count, sum, avg,
> stddev, array_agg can which are most of what people do. That would be
> a function which can take two state variables and produce a new state
> variable.

This information could also be useful to have in PL/Proxy (or similar
FDWs) to be able to integrate aggregate computation into the language.
Currently, you always have to do the state merging yourself.




Re: Odd out of memory problem.

From
Hitoshi Harada
Date:
On Thu, Mar 29, 2012 at 7:38 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote:
>> Hm. So my original plan was dependent on adding the state-merge
>> function we've talked about in the past. Not all aggregate functions
>> necessarily can support such a function but I think all or nearly all
>> the builtin aggregates can. Certainly min,max, count, sum, avg,
>> stddev, array_agg can which are most of what people do. That would be
>> a function which can take two state variables and produce a new state
>> variable.
>
> This information could also be useful to have in PL/Proxy (or similar
> FDWs) to be able to integrate aggregate computation into the language.
> Currently, you always have to do the state merging yourself.
>

I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase
aggregate, but in theory the proposal above is state-merge function,
so it doesn't apply to general aggregate results that passed through
the final function.  Of course some functions that don't have final
functions are ok to call state-merge function on the results.


Thanks,
-- 
Hitoshi Harada


Re: Odd out of memory problem.

From
Peter Eisentraut
Date:
On fre, 2012-03-30 at 22:59 -0700, Hitoshi Harada wrote:
> I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase
> aggregate,

They don't.

> but in theory the proposal above is state-merge function,
> so it doesn't apply to general aggregate results that passed through
> the final function.  Of course some functions that don't have final
> functions are ok to call state-merge function on the results.

You're right, it's not quite the same thing.  But perhaps it could be
kept in mind if someone wants to develop things in this area.