Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCXfk-Syby-MWLFfN-v=7HFEm-QYyX+tP=NVZ+iNw8ELUw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
On Sat, 16 Mar 2019 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> > 28). I just spotted the 1MB limit on the serialised MCV list size. I
> > think this is going to be too limiting. For example, if the stats
> > target is at its maximum of 10000, that only leaves around 100 bytes
> > for each item's values, which is easily exceeded. In fact, I think
> > this approach for limiting the MCV list size isn't a good one --
> > consider what would happen if there were lots of very large values.
> > Would it run out of memory before getting to that test? Even if not,
> > it would likely take an excessive amount of time.
> >
>
> True. I don't have a very good argument for a specific value, or even
> having an explicit limit at all. I've initially added it mostly as a
> safety for development purposes, but I think you're right we can just
> get rid of it. I don't think it'd run out of memory before hitting the
> limit, but I haven't tried very hard (but I recall running into the 1MB
> limit in the past).
>

I've just been playing around a little with this and found that it
isn't safely dealing with toasted values. For example, consider the
following test:

create or replace function random_string(x int) returns text
as $$
  select substr(string_agg(md5(random()::text), ''), 1, x)
    from generate_series(1,(x+31)/32);
$$ language sql;

drop table if exists t;
create table t(a int, b text);
insert into t values (1, random_string(10000000));
create statistics s (mcv) on a,b from t;
analyse t;

select length(b), left(b,5), right(b,5) from t;
select length(stxmcv), length((m.values::text[])[2]),
       left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
  from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
 where stxrelid = 't'::regclass;

The final query returns the following:

 length |  length  | left  | right
--------+----------+-------+-------
    250 | 10000000 | c2667 | 71492
(1 row)

suggesting that there's something odd about the stxmcv value. Note,
also, that it doesn't hit the 1MB limit, even though the value is much
bigger than that.

If I then delete the value from the table, without changing the stats,
and repeat the final query, it falls over:

delete from t where a=1;
select length(stxmcv), length((m.values::text[])[2]),
       left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
  from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
 where stxrelid = 't'::regclass;

ERROR:  unexpected chunk number 5008 (expected 0) for toast value
16486 in pg_toast_16480

So I suspect it was using the toast data from the table t, although
I've not tried to investigate further.


> > I think this part of the patch needs a bit of a rethink. My first
> > thought is to do something similar to what happens for per-column
> > MCVs, and set an upper limit on the size of each value that is ever
> > considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
> > toowide_cnt in analyse.c). Over-wide values should be excluded early
> > on, and it will need to track whether or not any such values were
> > excluded, because then it wouldn't be appropriate to treat the stats
> > as complete and keep the entire list, without calling
> > get_mincount_for_mcv_list().
> >
> Which part? Serialization / deserialization? Or how we handle long
> values when building the MCV list?
>

I was thinking (roughly) of something like the following:

* When building the values array for the MCV list, strip out rows with
values wider than some threshold (probably something like the
WIDTH_THRESHOLD = 1024 from analyse.c would be reasonable).

* When building the MCV list, if some over-wide values were previously
stripped out, always go into the get_mincount_for_mcv_list() block,
even if nitems == ngroups (for the same reason a similar thing happens
for per-column stats -- if some items were stripped out, we're already
saying that not all items will go in the MCV list, and it's not safe
to assume that the remaining items are common enough to give accurate
estimates).

* In the serialisation code, remove the size limit entirely. We know
that each value is now at most 1024 bytes, and there are at most 10000
items, and at most 8 columns, so the total size is already reasonably
well bounded. In the worst case, it might be around 80MB, but in
practice, it's always likely to be much much smaller than that.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: Offline enabling/disabling of data checksums
Next
From: Dean Rasheed
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists