Both $SUBJECT functions pass to hash_create() an expected hash table size of
10000 * attstattarget. Based on header comments, this represents a near-worst
case. These typanalyze functions scan the hash tables sequentially, thereby
visiting the entire allocation. Per the recommendation in comments at
hash_create(), we should be more conservative. On my system, this tiny test
case runs in 2.8s and dirties 1.0 GiB of local memory:
CREATE TEMP TABLE t AS SELECT '{}'::int[];
SET default_statistics_target = 10000;
ANALYZE t;
Rather arbitrarily, I reduced the hash_create() size hint by 99.9%, to the
width of the histograms destined for pg_statistic. This streamlined the test
case to <20ms runtime and 2 MiB of memory.
To verify that nothing awful happens when the hash table sees considerable
dynamic growth, I used a subject table entailing a 9M-element hash table at
ANALYZE time:
CREATE UNLOGGED TABLE t AS SELECT array[n,3000000+n,6000000+n]
FROM generate_series(1,3000000) t(n);
Unpatched master takes 15s and dirties 2.1 GiB; patched takes 15s and dirties
1.2 GiB. The timing noise overlapped any systematic difference, but the
patched version might have been around 500ms slower. Based on that, I'm
comfortable trusting that improving smaller cases in this way will not greatly
harm larger cases.
The lack of field complaints about ts_typanalyze() resource usage does argue
against the need for a change here, but I think adding array_typanalyze() in
PostgreSQL 9.2 significantly increases our risk exposure. Sites may have
cranked up the statistics target on array columns to compensate for the lack
of explicit statistical support. Every cluster has several array columns in
the system catalogs.
The size hint I chose is fairly arbitrary. Any suggestions for principled
alternatives?
Thanks,
nm