Re: [PATCH] pgbench log file headers - Mailing list pgsql-hackers

From Adam Hendel
Subject Re: [PATCH] pgbench log file headers
Date
Msg-id CABfuTgiou00261Mt5P3MHQ4fiw50EN793C=+T5yHJtQBpBvEVw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] pgbench log file headers  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hello


On Mon, Nov 13, 2023 at 6:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-11-13 11:55:07 -0600, Adam Hendel wrote:
> Currently, pgbench will log individual transactions to a logfile when the
> `--log` parameter flag is provided. The logfile, however, does not include
> column header. It has become a fairly standard expectation of users to have
> column headers present in flat files. Without the header in the pgbench log
> files, new users must navigate to the docs and piece together the column
> headers themselves. Most industry leading frameworks have tooling built in
> to read column headers though, for example python/pandas read_csv().

The disadvantage of doing that is that a single pgbench run with --log will
generate many files when using -j, to avoid contention. The easiest way to
deal with that after the run is to cat all the log files to a larger one. If
you do that with headers present, you obviously have a few bogus rows (the
heades from the various files).

I guess you could avoid the "worst" of that by documenting that the combined
log file should be built by
  cat {$log_prefix}.${pid} {$log_prefix}.${pid}.*
and only outputting the header in the file generated by the main thread.
 
> We can improve the experience for users by adding column headers to pgbench
> logfiles with an optional command line flag, `--log-header`. This will keep
> the API backwards compatible by making users opt-in to the column headers.
> It follows the existing pattern of having conditional flags in pgbench’s
> API; the `--log` option would have both –log-prefix and –log-header if this
> work is accepted.

> The implementation considers the column headers only when the
> `--log-header` flag is present. The values for the columns are taken
> directly from the “Per-Transaction Logging” section in
> https://www.postgresql.org/docs/current/pgbench.html and takes into account
> the conditional columns `schedule_lag` and `retries`.
>
>
> Below is an example of what that logfile will look like:
>
>
> pgbench  postgres://postgres:postgres@localhost:5432/postgres --log
> --log-header
>
> client_id transaction_no time script_no time_epoch time_us

Independent of your patch, but we imo ought to combine time_epoch time_us in
the log output. There's no point in forcing consumers to combine those fields,
and it makes logging more expensive...  And if we touch this, we should just
switch to outputting nanoseconds instead of microseconds.

It also is quite odd that we have "time" and "time_epoch", "time_us", where
time is the elapsed time of an individual "transaction" and time_epoch +
time_us together are a wall-clock timestamp.  Without differentiating between
those, the column headers seem not very useful, because one needs to look in
the documentation to understand the fields.

I don't think there's all that strong a need for backward compatibility in
pgbench. We could just change the columns as I suggest above and then always
emit the header in the "main" log file.


Do you think this should be done in separate patches?

First patch: log the column headers in the "main" log file. No --log-header flag, make it the default behavior of --log.

Next patch: collapse "time_epoch" and "time_us" in the log output and give the "time" column a name that is more clear.

Adam
 
Greetings,

Andres Freund


 

pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: trying again to get incremental backup
Next
From: Daniel Gustafsson
Date:
Subject: Re: Allow tests to pass in OpenSSL FIPS mode