Thread: Re: vacuumdb changes for stats import/export
On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > [v2] I started looking just at 0001, and it seems like a fairly straightforward rearrangement. I found this comment quite hard to read: + * 'found_objs' should be a fully qualified list of objects to process, as + * returned by a previous call to vacuum_one_database(). If *found_objs is + * NULL, it is set to the results of the catalog query discussed below. If + * found_objs is NULL, the results of the catalog query are not returned. + * + * If *found_objs is NULL, this function performs a catalog query to retrieve + * the list of tables to process. When 'objects' is NULL, all tables in the I had to read it several times before I noticed the difference between "* found_objs" and "*found_objs". Maybe some extra spacing and breaks would help, or other reorganization. -- John Naylor Amazon Web Services
On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote: > On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> [v2] > > I started looking just at 0001, and it seems like a fairly > straightforward rearrangement. Thanks for taking a look. > I found this comment quite hard to read: > > + * 'found_objs' should be a fully qualified list of objects to process, as > + * returned by a previous call to vacuum_one_database(). If *found_objs is > + * NULL, it is set to the results of the catalog query discussed below. If > + * found_objs is NULL, the results of the catalog query are not returned. > + * > + * If *found_objs is NULL, this function performs a catalog query to retrieve > + * the list of tables to process. When 'objects' is NULL, all tables in the > > I had to read it several times before I noticed the difference between > "* found_objs" and "*found_objs". Maybe some extra spacing and breaks > would help, or other reorganization. Yeah, it's pretty atrocious. I think the main problem is that the interface is just too complicated, so I'll take a step back and see if I can make it more understandable to humans. In the meantime, here's an attempt at adjusting the comment: * found_objs is a double pointer to a fully qualified list of objects to * process, as returned by a previous call to vacuum_one_database(). If * *found_objs (the once-dereferenced double pointer) is NULL, it is set to the * results of the catalog query discussed below. If found_objs (the double * pointer) is NULL, the results of the catalog query are not returned. * * If *found_objs (the once-dereferenced double pointer) is NULL, this function * performs a catalog query to retrieve the list of tables to process. When * "objects" is NULL, all tables in the database are processed. Otherwise, the * catalog query performs a lookup for the objects listed in "objects". -- nathan
On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote: > > I had to read it several times before I noticed the difference between > > "* found_objs" and "*found_objs". Maybe some extra spacing and breaks > > would help, or other reorganization. > > Yeah, it's pretty atrocious. I think the main problem is that the > interface is just too complicated, so I'll take a step back and see if I > can make it more understandable to humans. The interface is awkward, but on the other hand only a small part has to really know about it. It's worth trying to make it more readable if you can. > In the meantime, here's an > attempt at adjusting the comment: That's better, and if we end up with this interface, we'll want quotes around the names so the eye can tell where the "*" belong. -- John Naylor Amazon Web Services
On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote: > On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote: >> > I had to read it several times before I noticed the difference between >> > "* found_objs" and "*found_objs". Maybe some extra spacing and breaks >> > would help, or other reorganization. >> >> Yeah, it's pretty atrocious. I think the main problem is that the >> interface is just too complicated, so I'll take a step back and see if I >> can make it more understandable to humans. > > The interface is awkward, but on the other hand only a small part has > to really know about it. It's worth trying to make it more readable if > you can. True. One small thing we could do is to require "found_objs" (the double pointer) to always be non-NULL, but that just compels some callers to provide otherwise-unused variables. I've left the interface alone for now. >> In the meantime, here's an attempt at adjusting the comment: > > That's better, and if we end up with this interface, we'll want quotes > around the names so the eye can tell where the "*" belong. I did that in v3. I also tried to break up this comment into bullet points for better separation and logical flow. -- nathan
Attachment
On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote: > True. One small thing we could do is to require "found_objs" (the double > pointer) to always be non-NULL, but that just compels some callers to > provide otherwise-unused variables. I've left the interface alone for now. One thing to consider is that a pointer named "dummy" is self-documenting. > > That's better, and if we end up with this interface, we'll want quotes > > around the names so the eye can tell where the "*" belong. > > I did that in v3. I also tried to break up this comment into bullet points > for better separation and logical flow. That's much easier to follow, thanks. -- John Naylor Amazon Web Services
On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote: > On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I did that in v3. I also tried to break up this comment into bullet points >> for better separation and logical flow. > > That's much easier to follow, thanks. Thanks for looking. I'll probably commit 0001 sooner than later so that we can focus our attention on 0002. -- nathan
On Wed, Mar 5, 2025 at 12:13 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote: > > On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> I did that in v3. I also tried to break up this comment into bullet points > >> for better separation and logical flow. > > > > That's much easier to follow, thanks. > > Thanks for looking. I'll probably commit 0001 sooner than later so that we > can focus our attention on 0002. +1 On 0002: + This option can only be used in conjunction with + <option>--analyze-only</option> and <option>--analyze-in-stages</option>. + /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */ + if (vacopts.missing_only && !vacopts.analyze_only) + pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"", + "missing-only", "analyze-only", "analyze-in-stages"); The first is slightly ambiguous, so maybe "or" is better throughout. + " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n" Looking elsewhere in this file, I think we prefer something like "(c.relkind OPERATOR(pg_catalog.=) ANY (array[" CppAsString2(RELKIND_PARTITIONED_TABLE) ", " CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n" + " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n" + " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n" + " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n" + " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n"); IIUC correctly, pg_statistic doesn't store stats on itself, so this causes the query result to always contain pg_statistic -- does that get removed elsewhere? -- John Naylor Amazon Web Services
On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote: > + This option can only be used in conjunction with > + <option>--analyze-only</option> and > <option>--analyze-in-stages</option>. > > + /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */ > + if (vacopts.missing_only && !vacopts.analyze_only) > + pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"", > + "missing-only", "analyze-only", "analyze-in-stages"); > > The first is slightly ambiguous, so maybe "or" is better throughout. Agreed. > + " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n" > > Looking elsewhere in this file, I think we prefer something like > "(c.relkind OPERATOR(pg_catalog.=) ANY (array[" > CppAsString2(RELKIND_PARTITIONED_TABLE) ", " > CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n" Fixed. > + " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n" > + " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n" > + " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n" > + " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n"); > > IIUC correctly, pg_statistic doesn't store stats on itself, so this > causes the query result to always contain pg_statistic -- does that > get removed elsewhere? Good catch. I think the current behavior is to call ANALYZE on pg_statistic, too, but that should be mostly harmless (analyze_rel() refuses to process it). I suppose we could try to avoid returning pg_statistic from the catalog query, but we don't bother doing that for any other vacuumdb modes, so I'm tempted to leave it alone. -- nathan
Attachment
On Fri, Mar 7, 2025 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote: > > IIUC correctly, pg_statistic doesn't store stats on itself, so this > > causes the query result to always contain pg_statistic -- does that > > get removed elsewhere? > > Good catch. I think the current behavior is to call ANALYZE on > pg_statistic, too, but that should be mostly harmless (analyze_rel() > refuses to process it). I suppose we could try to avoid returning > pg_statistic from the catalog query, but we don't bother doing that for any > other vacuumdb modes, so I'm tempted to leave it alone. Okay, thanks for confirming. I have no further comments. -- John Naylor Amazon Web Services
On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote: > I have no further comments. Thanks. I'll give this a little more time for review before committing. We'll still need to update the recommendation in pg_upgrade's documentation. I'm going to keep that separate because the stats import/export work is still settling. -- nathan
On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote: > On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote: >> I have no further comments. > > Thanks. I'll give this a little more time for review before committing. I realized that we could limit the catalog query reuse to only when --missing-only is specified, so I've updated 0001 and 0002 accordingly. This avoids changing any existing behavior. > We'll still need to update the recommendation in pg_upgrade's > documentation. I'm going to keep that separate because the stats > import/export work is still settling. 0003 is a first attempt at this. Unless I am missing something, there's really not much to update. -- nathan
Attachment
On Wed, Mar 12, 2025 at 12:00 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote: > > On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote: > >> I have no further comments. > > > > Thanks. I'll give this a little more time for review before committing. > > I realized that we could limit the catalog query reuse to only when > --missing-only is specified, so I've updated 0001 and 0002 accordingly. > This avoids changing any existing behavior. > > > We'll still need to update the recommendation in pg_upgrade's > > documentation. I'm going to keep that separate because the stats > > import/export work is still settling. > > 0003 is a first attempt at this. Unless I am missing something, there's > really not much to update. The change here seems fine. My only quibble is that this sentence now seems out of place: "Option --analyze-in-stages can be used to generate minimal statistics quickly." I'm thinking we should make a clearer separation for with and without the --no-statistics option specified. -- John Naylor Amazon Web Services
On Wed, Mar 12, 2025 at 01:56:04PM +0700, John Naylor wrote: > The change here seems fine. My only quibble is that this sentence now > seems out of place: "Option --analyze-in-stages can be used to > generate minimal statistics quickly." I'm thinking we should make a > clearer separation for with and without the --no-statistics option > specified. I think the recommendation stays the same regardless of whether --no-statistics was used. --missing-only should do the right think either way. But I do agree that this paragraph feels a bit haphazard. I modified it to call out the full command for both --analyze-only and --analyze-in-stages, and I moved the note about --jobs to its own sentence and made it clear that it is applicable to both of the suggested vacuumdb commands. -- nathan
Attachment
Out of curiosity, I generated many relations with the following command (stolen from [0]): do $$ begin for i in 1..100000 loop execute format('create table t%s (f1 int unique, f2 int unique);', i); execute format('insert into t%s select x, x from generate_series(1,1000) x', i); if i % 100 = 0 then commit; end if; end loop; end $$; And then I ran a database-wide ANALYZE. Without --missing-only, vacuumdb's catalog query took 65 ms. With --missing-only, it took 735 ms. While that's a big jump, this query will only run once for a given vacuumdb, and --missing-only is likely to save a lot of time elsewhere. If no feedback or objections materialize, I'm planning to commit these early next week. [0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us -- nathan
On Fri, Mar 14, 2025 at 02:05:30PM -0500, Nathan Bossart wrote: > If no feedback or objections materialize, I'm planning to commit these > early next week. While preparing this for commit, I noticed that the expression index part of the query was disregarding attstattarget. To fix, I've modified that part to look at the index's pg_attribute entries. -- nathan
Attachment
While preparing this for commit, I noticed that the expression index part
of the query was disregarding attstattarget. To fix, I've modified that
part to look at the index's pg_attribute entries.
+1, should have been there all along.
Committed with the following small changes: * I renamed the new option to --missing-stats-only, which I felt was more descriptive. * I moved the new tests to the main vacuumdb test file and interspersed some --analyze-only uses. * I fixed a couple of other things in the new parts of the catalog query that were not fully qualified. -- nathan