Thread: O(N^2) when building multi-column MCV lists

O(N^2) when building multi-column MCV lists

From
Tomas Vondra
Date:
Hi,

When experimenting with multi-column MCV lists with statistic target set
to high value (e.g. 10k), I've realized there's an O(N^2) issue in
statext_mcv_build() when computing base frequencies.

We do this:

    for (i = 0; i < nitems; i++)
    {
        ...
        item->base_frequency = 1.0;
        for (j = 0; j < numattrs; j++)
        {
            int    count = 0;
            int    k;

            for (k = 0; k < ngroups; k++)
            {
                if (multi_sort_compare_dim(j, &groups[i], &groups[k], mss) == 0)
                    count += groups[k].count;
            }

            ...
        }
    }


That is, for each item on the MCV list, we walk through all the groups
(for each dimension independently) to determine the total frequency of
the value.

With many groups (which can easily happen for high statistics target),
this can easily get very expensive.

I think the best solution here is to pre-compute frequencies for values
in all dimensions, and then just access that instead of looping through
the groups over and over.

IMHO this is something we should fix for PG12, so I'll put that on the
open items list, and produce a fix shortly.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: O(N^2) when building multi-column MCV lists

From
Tomas Vondra
Date:
Hi,

Attached is a WIP/PoC fix addressing the O(N^2) behavior in ANALYZE with
high statistic target values. It needs more work, but it's good enough to
show some measurements.

For benchmark, I've created a simple 2-column table, with MCV list on
those two columns:

  CREATE TABLE t (a int, b int);
  CREATE STATISTICS s (mcv) ON a,b FROM t;

and then loaded data sets with different numbers of random combinations,
determined by number of values in each column. For example with 10 values
in a column, you get ~100 combinations.

  INSERT INTO t
  SELECT 10*random(), 10*random() FROM generate_series(1,3e6);

The 3M rows is picked because that's the sample size with target 10000.

The results with different statistic targets look like this:

1) master

    values        100       1000        5000       10000
    ====================================================
        10        103        586        2419        3041
       100        116        789        4778        8934
      1000        116        690        3162      499748

2) patched

    values        100       1000        5000       10000
    ====================================================
        10        113        606        2460        3716
       100        143        711        3371        5231
      1000        156        994        3836        6002

3) comparison (patched / master)

    values        100       1000        5000       10000
    ====================================================
        10       110%       103%        102%        122%
       100       123%        90%         71%         59%
      1000       134%       144%        121%          1%


So clearly, the issue for large statistic targets is resolved (duration
drops from 500s to just 6s), but there is measurable regression for the
other cases. That needs more investigation & fix before commit.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment