Re: Statistics Import and Export - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Statistics Import and Export
Date
Msg-id CADkLM=eM7rqJpArtRrxsVbOn=3GBRUyKtQbC_TxtFe010QLR8Q@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Statistics Import and Export
List pgsql-hackers
Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

It's more than I can memorize, so that feels like a lot to me. Clearly it's not insurmountable.

 
> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

Ok, I'm sold on that plan.
 

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +     const char *param_names[] = {
> +             "relation",
> +             "reltuples",
> +             "relpages",
> +     };
> +
> +     Oid                             relid;
> +     Relation                rel;
> +     HeapTuple               ctup;
> +     Form_pg_class   pgcform;
> +
> +     for (int i = 0; i <= 2; i++)
> +             if (PG_ARGISNULL(i))
> +                     ereport(ERROR,
> +                                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                      errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?

Strict could definitely apply here, and I'm inclined to make it so.

 
Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).

Thoughts on it emitting a WARN or NOTICE before returning false?

 
  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +     pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Yes. No idea why I didn't have that in there from the start.
 
Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

That was how it worked back in v1, more for the possibility that there was no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if there is a row in the way, and if so we replace it. I don't see where we compare existing values to new values.
 

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

Historical artifact - previous versions had a long explanation of the JSON format.

 

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +     /* names of columns that cannot be null */
> +     const char *required_param_names[] = {
> +             "relation",
> +             "attname",
> +             "stainherit",
> +             "stanullfrac",
> +             "stawidth",
> +             "stadistinct",
> +             "stakind1",
> +             "stakind2",
> +             "stakind3",
> +             "stakind4",
> +             "stakind5",
> +     };

Same comment here as above wrt NULL being passed in.

In this case, the last 10 params (stanumbersN and stavaluesN) can be null, and are NULL more often than not.
 

> +     for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Yes, I had in the past. Not sure why I didn't again. 

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: btree: downlink right separator/HIKEY optimization
Next
From: Tomas Vondra
Date:
Subject: Re: speed up a logical replica setup