Re: tests against running server occasionally fail, postgres_fdw & tenk1 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: tests against running server occasionally fail, postgres_fdw & tenk1
Date
Msg-id 20230209002935.2mq47txxcpvcgyqv@awork3.anarazel.de
Whole thread Raw
In response to Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: tests against running server occasionally fail, postgres_fdw & tenk1
List pgsql-hackers
Hi,

On 2023-02-08 14:03:49 -0800, Peter Geoghegan wrote:
> On Tue, Feb 7, 2023 at 6:47 PM Andres Freund <andres@anarazel.de> wrote:
> > One thing I'm not quite sure what to do about is that we atm use a hardcoded
> > DEBUG2 (not controlled by VERBOSE) in a bunch of places:
> >
> >         ereport(DEBUG2,
> >                         (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> >                                         vacrel->relname, (long long) index, vacuumed_pages)));
> >
> >     ivinfo.message_level = DEBUG2;
> >
> > I find DEBUG2 hard to use to run the entire regression tests, it results in a
> > lot of output. Lots of it far less important than these kinds of details
> > here. So I'd like to use a different log level for them - but without further
> > complications that'd mean they'd show up in VACUUUM VERBOSE.
>
> I think that these DEBUG2 traces are of limited value, even for
> experts. I personally never use them -- I just use VACUUM
> VERBOSE/autovacuum logging for everything, since it's far easier to
> read, and isn't missing something that the DEBUG2 traces have. In fact
> I wonder if we should even have them at all.

I find it useful information when debugging problems. Without it, the log
doesn't tell you which index was processed when a problem started to occur. Or
even that we were scanning indexes at all.

IOW, I care a lot more about log entries denoting the start / end of an index
vacuum than attached numbers. Although I also care about those, to some
degree.



> I generally don't care about the details when VACUUM runs out of space
> for dead_items -- these days the important thing is to just avoid it
> completely.

I wonder if it'd possibly make sense to log more verbosely if we do end up
running out of space, but not otherwise? Particularly because it's important
to avoid multi-pass vacuums? Right now there's basically no log output until
the vacuum finished, which in a bad situation could take days, with many index
scan cycles. Logging enough to be able to see such things happening IMO is
important.



> I also generally don't care about how many index tuples
> were deleted by each index's ambulkdelete() call, since VACUUM VERBOSE
> already shows me the number of LP_DEAD stubs encountered/removed in
> the heap.

Isn't it actually quite useful to see how many entries were removed from the
index relative to the number of LP_DEAD removed in the heap? And relative to
other indexes? E.g. to understand how effective killtuple style logic is?


One annoyance with the multiple logs as part of a [auto]vacuum is that we end
up logging the context / statement repeatedly, leading to pointless output
like this:

2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] LOG:  vacuuming "postgres.public.large"
2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] LOG:  scanned index "large_pkey" to remove 499994
rowversions
 
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] CONTEXT:  while vacuuming index "large_pkey" of
relation"public.large"
 
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
...
2023-02-08 15:55:03.496 PST [3904676][client backend][2/55:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG:  vacuuming
"postgres.pg_toast.pg_toast_3370138"
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] STATEMENT:  VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG:  finished vacuuming
"postgres.pg_toast.pg_toast_3370138":index scans: 0
 


I think we should emit most of the non-error/warning messages in vacuum with
errhidestmt(true), errhidecontext(true) to avoid that. The error context is
quite helpful for error messages due to corruption, cancellations and such,
but not for messages where we already are careful to include relation names.


I generally like the improved log output for [auto]vacuum, but the fact that
you can't see anymore when index scans start imo problematic. Right now you
can't even infer how long the first index scan takes, which really isn't
great.


I'd thus like to:

1) Use errhidestmt(true), errhidecontext(true) for vacuum
   ereport(non-error-or-warning)

2) Add a message to lazy_vacuum() or lazy_vacuum_all_indexes(), that includes
   - num_index_scans
   - how many indexes we'll scan
   - how many dead tids we're working on removing

3) Add a log at the end of lazy_vacuum_heap_rel() that's logged only (or more
   verbosely) when lazy_vacuum() was run due to running out of space

   If we just do the heap scan once, this can be easily inferred from the
   other messages.

4) When we run out of space for dead tids, increase the log level for the rest
   of vacuum. It's sufficiently bad if that happens that we really ought to
   include it in the log by default.

5) Remove the row versions from vac_bulkdel_one_index()'s message, it's
   already included in 2).

   Instead we could emit the content from vac_cleanup_one_index(), imo that's
   a lot more useful when separated for each scan.

6) Possibly remove the log output from vac_cleanup_one_index()?


2) and 3) together allow to figure out how long individual scan / vacuum
phases are taking. 1) should reduce log verbosity sufficiently to make it
easier to actually read the output.


FWIW, I'm not proposing to do all of that in one patch, once I understand a
bit more what we have concensus and what not I can split it into steps.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Weird failure with latches in curculio on v15
Next
From: Peter Smith
Date:
Subject: Re: [PATCH] Add pretty-printed XML output option