Re: pg_stat_io_histogram - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pg_stat_io_histogram
Date
Msg-id 4ffe367d-77ac-488a-8826-64efb78d6f78@vondra.me
Whole thread Raw
In response to Re: pg_stat_io_histogram  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: pg_stat_io_histogram
List pgsql-hackers
Hi,

I've looked at this patch today, as it's tangentially related to the
patch adding prefetch stats to EXPLAIN. And earlier WIP version of the
EXPLAIN patch included a couple histograms (not for timings, but for IO
sizes, prefetch distances, etc.). And during development it was quite
useful. We decided to keep EXPLAIN simpler, but having that at least in
a pg_stat_io_histogram view would be nice. So +1 to this.

I went through the thread today. A lot has been discussed, most issues
seem to have been solved, but what are the outstanding ones? I think it
would be helpful if someone involved (=Jakub) could write a brief
summary, listing the various open questions.


Now let me add a couple open questions of my own ;-)

My understanding is that the performance (i.e. in terms of CPU) is fine.
I'm running some tests on my own, both to check the behavior and to
learn about it. And so far the results show no change in performance,
it's all within 1% of master. So that's fine.


memory usage
------------
AFAICS the primary outstanding issue seems to be how to represent the
histograms in each backend, so that it's not wasteful but also not
overly complex. I'm not sure what's the current situation, and how far
from acceptable it is.


histogram range
---------------
Another questions is what should be the range of the histogram, and
whether the buckets should be uint32 or uint64. It's somewhat related to
the previous question, because smaller histograms need less memory
(obviously). I think the simpler the better, which means fixed-size
histograms, with a fixed number of buckets of equal size (e.g. uint64).

But that implies limited range / precision, so I think we need to decide
whether we prefer accurate buckets for low of high latencies.

The motivation for adding the histograms was investigating performance
issues with storage, which involves high latencies. So that would prefer
better tracking of higher latencies (and accepting lower resolution for
buckets close to 0). In v7 the first bucket is [0,8) microsecs, which to
me seems unnecessarily detailed. What if we started with 64us? That'd
get us to ~1s in the last bucket, and I'd imagine that's enough. We
could use the last bucket as "latencies above 1s". If you have a lot of
latencies beyond 1s, you have serious problems.

Yes, you can get 10us with NVMe, and so if everything works OK
everything will fall into the first bucket. So what? I think it's a
reasonable trade off. We have to compromise somewhere.

Alternatively, we could make the histograms more complex. We could learn
a thing or two from ddsketch, for example - it can dynamically change
the range of the histogram, depending on input.

We could also make the buckets variable-sized. The buckets have
different widths, and assuming uniform distribution will get different
number of matches - with bucket N getting ~1/2 of bucket (N+1). So it
could be represented by one fewer bit. But it adds complexity, and IO
latencies are unlikely to be uniformly distributed.

Alternatively we could use uint32 buckets, as proposed by Andres:

> I guess we could count IO as 4 byte integers, and shift all bucket
> counts down in the rare case of an on overflow. It's just a 2x
> improvement, but ...

That'd mean we start sampling the letencies, and only add 50% of them to
the histogram. And we may need to do that repeatedly, cutting the sample
rate in 1/2 every time. Which is probably fine for the purpose of this
view, but it adds complexity, and it means you have to "undo" this when
displaying the data. Otherwise it'd be impossible to combine or compare
histograms.

Anyway, what I'm trying to say is that we should keep the histograms
simple, at least for now.


other histograms
----------------
As mentioned, the EXPLAIN PoC patch had I/O histograms for I/O sizes,
in-progress I/Os, etc. I wonder if maybe it'd make sense to have
something like that in pg_stat_io_histogram too. Ofc, that goes against
the effort to reduce the memory usage etc.


a couple minor review comments
------------------------------

1) There seems to be a bug in calculating the buckets in the SRF:

       backend_type    |  ...  | bucket_latency_us |  ...
    -------------------+- ... -+-------------------+- ...
     client backend    |  ...  | [0,9)             |  ...
     client backend    |  ...  | [8,17)            |  ...
     client backend    |  ...  | [16,33)           |  ...
     client backend    |  ...  | [32,65)           |  ...
     client backend    |  ...  | [64,129)          |  ...

Notice the upper boundary is includes the lower boundary of the next
bucket. It should be [0,8), [8,...). pg_stat_io_histogram_build_tuples
should probably set "upper.inclusive = false;".

2) This change in monitoring.sgml seems wrong:

   <structname>pg_stat_io_histogram</structname> set of views ...

AFAICS it should still say "pg_stat_io set of views", but maybe it
should mention the pg_stat_io_histogram too.

3) pg_leftmost_one_pos64 does this:

  #if SIZEOF_LONG == 8
    return 63 - __builtin_clzl(word);
  #elif SIZEOF_LONG_LONG == 8
    return 63 - __builtin_clzll(word);
  #else

Shouldn't pg_leading_zero_bits64 do the same thing?

4) I'm not sure about this pattern:

    PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
    if(bktype == -1)
        continue;

Maybe it won't trigger an out-of-bounds read, it the compiler is smart
enough to delay the access to when the pointer is really needed. But it
seems confusing / wrong, and I don't think we do this elsewhere. For
example the functions in pgstat_io.c do this:

    PgStat_BktypeIO *bktype_shstats;
    if (bktype == -1)
        continue;
    bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Change copyObject() to use typeof_unqual
Next
From: Nathan Bossart
Date:
Subject: Re: Add starelid, attnum to pg_stats and leverage this in pg_dump