Thread: Re: vacuumdb changes for stats import/export

Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> [v2]

I started looking just at 0001, and it seems like a fairly
straightforward rearrangement. I found this comment quite hard to
read:

+ * 'found_objs' should be a fully qualified list of objects to process, as
+ * returned by a previous call to vacuum_one_database().  If *found_objs is
+ * NULL, it is set to the results of the catalog query discussed below.  If
+ * found_objs is NULL, the results of the catalog query are not returned.
+ *
+ * If *found_objs is NULL, this function performs a catalog query to retrieve
+ * the list of tables to process.  When 'objects' is NULL, all tables in the

I had to read it several times before I noticed the difference between
"* found_objs" and "*found_objs". Maybe some extra spacing and breaks
would help, or other reorganization.

--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:
> On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> [v2]
> 
> I started looking just at 0001, and it seems like a fairly
> straightforward rearrangement.

Thanks for taking a look.

> I found this comment quite hard to read:
> 
> + * 'found_objs' should be a fully qualified list of objects to process, as
> + * returned by a previous call to vacuum_one_database().  If *found_objs is
> + * NULL, it is set to the results of the catalog query discussed below.  If
> + * found_objs is NULL, the results of the catalog query are not returned.
> + *
> + * If *found_objs is NULL, this function performs a catalog query to retrieve
> + * the list of tables to process.  When 'objects' is NULL, all tables in the
> 
> I had to read it several times before I noticed the difference between
> "* found_objs" and "*found_objs". Maybe some extra spacing and breaks
> would help, or other reorganization.

Yeah, it's pretty atrocious.  I think the main problem is that the
interface is just too complicated, so I'll take a step back and see if I
can make it more understandable to humans.  In the meantime, here's an
attempt at adjusting the comment:

 * found_objs is a double pointer to a fully qualified list of objects to
 * process, as returned by a previous call to vacuum_one_database().  If
 * *found_objs (the once-dereferenced double pointer) is NULL, it is set to the
 * results of the catalog query discussed below.  If found_objs (the double
 * pointer) is NULL, the results of the catalog query are not returned.
 *
 * If *found_objs (the once-dereferenced double pointer) is NULL, this function
 * performs a catalog query to retrieve the list of tables to process.  When
 * "objects" is NULL, all tables in the database are processed.  Otherwise, the
 * catalog query performs a lookup for the objects listed in "objects".

-- 
nathan



Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:

> > I had to read it several times before I noticed the difference between
> > "* found_objs" and "*found_objs". Maybe some extra spacing and breaks
> > would help, or other reorganization.
>
> Yeah, it's pretty atrocious.  I think the main problem is that the
> interface is just too complicated, so I'll take a step back and see if I
> can make it more understandable to humans.

The interface is awkward, but on the other hand only a small part has
to really know about it. It's worth trying to make it more readable if
you can.

> In the meantime, here's an
> attempt at adjusting the comment:

That's better, and if we end up with this interface, we'll want quotes
around the names so the eye can tell where the "*" belong.

--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote:
> On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:
>> > I had to read it several times before I noticed the difference between
>> > "* found_objs" and "*found_objs". Maybe some extra spacing and breaks
>> > would help, or other reorganization.
>>
>> Yeah, it's pretty atrocious.  I think the main problem is that the
>> interface is just too complicated, so I'll take a step back and see if I
>> can make it more understandable to humans.
> 
> The interface is awkward, but on the other hand only a small part has
> to really know about it. It's worth trying to make it more readable if
> you can.

True.  One small thing we could do is to require "found_objs" (the double
pointer) to always be non-NULL, but that just compels some callers to
provide otherwise-unused variables.  I've left the interface alone for now.

>> In the meantime, here's an attempt at adjusting the comment:
> 
> That's better, and if we end up with this interface, we'll want quotes
> around the names so the eye can tell where the "*" belong.

I did that in v3.  I also tried to break up this comment into bullet points
for better separation and logical flow.

-- 
nathan

Attachment

Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote:
> True.  One small thing we could do is to require "found_objs" (the double
> pointer) to always be non-NULL, but that just compels some callers to
> provide otherwise-unused variables.  I've left the interface alone for now.

One thing to consider is that a pointer named "dummy" is self-documenting.

> > That's better, and if we end up with this interface, we'll want quotes
> > around the names so the eye can tell where the "*" belong.
>
> I did that in v3.  I also tried to break up this comment into bullet points
> for better separation and logical flow.

That's much easier to follow, thanks.

--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote:
> On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I did that in v3.  I also tried to break up this comment into bullet points
>> for better separation and logical flow.
> 
> That's much easier to follow, thanks.

Thanks for looking.  I'll probably commit 0001 sooner than later so that we
can focus our attention on 0002.

-- 
nathan



Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Wed, Mar 5, 2025 at 12:13 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote:
> > On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> I did that in v3.  I also tried to break up this comment into bullet points
> >> for better separation and logical flow.
> >
> > That's much easier to follow, thanks.
>
> Thanks for looking.  I'll probably commit 0001 sooner than later so that we
> can focus our attention on 0002.

+1

On 0002:

+        This option can only be used in conjunction with
+        <option>--analyze-only</option> and
<option>--analyze-in-stages</option>.

+ /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */
+ if (vacopts.missing_only && !vacopts.analyze_only)
+ pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"",
+ "missing-only", "analyze-only", "analyze-in-stages");

The first is slightly ambiguous, so maybe "or" is better throughout.

+ " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n"

Looking elsewhere in this file, I think we prefer something like
"(c.relkind OPERATOR(pg_catalog.=) ANY (array["
  CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
  CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n"

+ " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
+ " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
+ " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
+ " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");

IIUC correctly, pg_statistic doesn't store stats on itself, so this
causes the query result to always contain pg_statistic -- does that
get removed elsewhere?

--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote:
> +        This option can only be used in conjunction with
> +        <option>--analyze-only</option> and
> <option>--analyze-in-stages</option>.
> 
> + /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */
> + if (vacopts.missing_only && !vacopts.analyze_only)
> + pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"",
> + "missing-only", "analyze-only", "analyze-in-stages");
> 
> The first is slightly ambiguous, so maybe "or" is better throughout.

Agreed.

> + " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n"
> 
> Looking elsewhere in this file, I think we prefer something like
> "(c.relkind OPERATOR(pg_catalog.=) ANY (array["
>   CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
>   CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n"

Fixed.

> + " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
> + " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
> + " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
> + " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
> 
> IIUC correctly, pg_statistic doesn't store stats on itself, so this
> causes the query result to always contain pg_statistic -- does that
> get removed elsewhere?

Good catch.  I think the current behavior is to call ANALYZE on
pg_statistic, too, but that should be mostly harmless (analyze_rel()
refuses to process it).  I suppose we could try to avoid returning
pg_statistic from the catalog query, but we don't bother doing that for any
other vacuumdb modes, so I'm tempted to leave it alone.

-- 
nathan

Attachment

Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Fri, Mar 7, 2025 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote:

> > IIUC correctly, pg_statistic doesn't store stats on itself, so this
> > causes the query result to always contain pg_statistic -- does that
> > get removed elsewhere?
>
> Good catch.  I think the current behavior is to call ANALYZE on
> pg_statistic, too, but that should be mostly harmless (analyze_rel()
> refuses to process it).  I suppose we could try to avoid returning
> pg_statistic from the catalog query, but we don't bother doing that for any
> other vacuumdb modes, so I'm tempted to leave it alone.

Okay, thanks for confirming. I have no further comments.

--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
> I have no further comments.

Thanks.  I'll give this a little more time for review before committing.

We'll still need to update the recommendation in pg_upgrade's
documentation.  I'm going to keep that separate because the stats
import/export work is still settling.

-- 
nathan



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote:
> On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
>> I have no further comments.
> 
> Thanks.  I'll give this a little more time for review before committing.

I realized that we could limit the catalog query reuse to only when
--missing-only is specified, so I've updated 0001 and 0002 accordingly.
This avoids changing any existing behavior.

> We'll still need to update the recommendation in pg_upgrade's
> documentation.  I'm going to keep that separate because the stats
> import/export work is still settling.

0003 is a first attempt at this.  Unless I am missing something, there's
really not much to update.

-- 
nathan

Attachment

Re: vacuumdb changes for stats import/export

From
John Naylor
Date:
On Wed, Mar 12, 2025 at 12:00 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote:
> > On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
> >> I have no further comments.
> >
> > Thanks.  I'll give this a little more time for review before committing.
>
> I realized that we could limit the catalog query reuse to only when
> --missing-only is specified, so I've updated 0001 and 0002 accordingly.
> This avoids changing any existing behavior.
>
> > We'll still need to update the recommendation in pg_upgrade's
> > documentation.  I'm going to keep that separate because the stats
> > import/export work is still settling.
>
> 0003 is a first attempt at this.  Unless I am missing something, there's
> really not much to update.

The change here seems fine. My only quibble is that this sentence now
seems out of place: "Option --analyze-in-stages can be used to
generate minimal statistics quickly." I'm thinking we should make a
clearer separation for with and without the  --no-statistics option
specified.


--
John Naylor
Amazon Web Services



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Wed, Mar 12, 2025 at 01:56:04PM +0700, John Naylor wrote:
> The change here seems fine. My only quibble is that this sentence now
> seems out of place: "Option --analyze-in-stages can be used to
> generate minimal statistics quickly." I'm thinking we should make a
> clearer separation for with and without the  --no-statistics option
> specified.

I think the recommendation stays the same regardless of whether
--no-statistics was used.  --missing-only should do the right think either
way.  But I do agree that this paragraph feels a bit haphazard.  I modified
it to call out the full command for both --analyze-only and
--analyze-in-stages, and I moved the note about --jobs to its own sentence
and made it clear that it is applicable to both of the suggested vacuumdb
commands.

-- 
nathan

Attachment

Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
Out of curiosity, I generated many relations with the following command
(stolen from [0]):

do $$
begin
for i in 1..100000 loop
  execute format('create table t%s (f1 int unique, f2 int unique);', i);
  execute format('insert into t%s select x, x from generate_series(1,1000) x',
                 i);
  if i % 100 = 0 then commit; end if;
end loop;
end
$$;

And then I ran a database-wide ANALYZE.  Without --missing-only, vacuumdb's
catalog query took 65 ms.  With --missing-only, it took 735 ms.  While
that's a big jump, this query will only run once for a given vacuumdb, and
--missing-only is likely to save a lot of time elsewhere.

If no feedback or objections materialize, I'm planning to commit these
early next week.

[0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us

-- 
nathan



Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
On Fri, Mar 14, 2025 at 02:05:30PM -0500, Nathan Bossart wrote:
> If no feedback or objections materialize, I'm planning to commit these
> early next week.

While preparing this for commit, I noticed that the expression index part
of the query was disregarding attstattarget.  To fix, I've modified that
part to look at the index's pg_attribute entries.

-- 
nathan

Attachment

Re: vacuumdb changes for stats import/export

From
Corey Huinker
Date:

While preparing this for commit, I noticed that the expression index part
of the query was disregarding attstattarget.  To fix, I've modified that
part to look at the index's pg_attribute entries.

+1, should have been there all along.

Re: vacuumdb changes for stats import/export

From
Nathan Bossart
Date:
Committed with the following small changes:

* I renamed the new option to --missing-stats-only, which I felt was more
  descriptive.

* I moved the new tests to the main vacuumdb test file and interspersed
  some --analyze-only uses.

* I fixed a couple of other things in the new parts of the catalog query
  that were not fully qualified.

-- 
nathan