Re: Observability in Postgres - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Observability in Postgres
Date
Msg-id CABUevEyDT06CheppWoW9k5sGhpVTcXbvyys81JyD27Hdq+4RFQ@mail.gmail.com
Whole thread Raw
In response to Re: Observability in Postgres  (Greg Stark <stark@mit.edu>)
Responses Re: Observability in Postgres
List pgsql-hackers
On Tue, Feb 15, 2022 at 11:24 PM Greg Stark <stark@mit.edu> wrote:
>
> On Tue, 15 Feb 2022 at 16:43, Magnus Hagander <magnus@hagander.net> wrote:
> >
> > On Tue, Feb 15, 2022 at 1:30 PM Dave Page <dpage@pgadmin.org> wrote:
> > >
> > > - Does it really matter if metrics are exposed on a separate port from the postmaster? I actually think doing
thatis a good thing as it allows use of alternative listen addresses and firewalling rules; you could then confine the
monitoringtraffic to a management VLAN for example. 
> >
> > +1. I think it would be much better to keep it on a separate port.
> >
> > Doesn't even have to be to the point of VLANs or whatever. You just
> > want your firewall rules to be able to know what data it's talking
> > about.
>
> I would definitely want that to be an option that could be configured.
> If you're deploying a server to be accessible as a public service and
> configuring firewall rules etc then sure you probably want to be very
> explicit about what is listening where.
>
> But when you're deploying databases automatically in a clustered type
> environment you really want a service to deploy on a given port and
> have the monitoring associated with that port as well. If you deploy
> five databases you don't want to have to deal with five other ports
> for monitoring and then have to maintain a database of which
> monitoring ports are associated with which service ports.... It's
> definitely doable -- that's what people do today -- but it's a pain
> and it's fragile and it's different at each site which makes it
> impossible for dashboards to work out of the box.

I really don't see the problem with having the monitoring on a different port.

I *do* see the problem with having a different monitoring port for
each database in a cluster, if that's what you're saying. Definitely.

But if it's 5432 for the database and 8432 for the monitoring for
example, I'd see that as an improvement. And if you're deploying a
larger cluster you're auto-configuring these things anyway so to have
your environment always set "monitoring port = database port + 3000"
for example should be trivial.


> > Another part missing in the proposal is how to deal with
> > authentication. That'll be an even harder problem if it sits on the
> > same port but speaks a different protocol. How would it work with
> > pg_hba etc?
>
> Wouldn't it make it easier to work with pg_hba? If incoming
> connections are coming through pg_hba then postmaster gets to accept
> or refuse the connection based on the host and TLS information. If
> it's listening on a separate port then unless that logic is duplicated
> it'll be stuck in a parallel world with different security rules.

I guess you could map it against yet another virtual database, like we
do with streaming replication.


> I'm not actually sure how to make this work. There's a feature in Unix
> where a file descriptor can be passed over from one process to another
> over a socket but that's gotta be a portability pain. And starting a
> new worker for each incoming connection would be a different pain.
>
> So right now I'm kind of guessing this might be just a hook in
> postmaster that we can experiment with in the module. The hook would
> just return a flag to postmaster saying the connection was handled.

If it was as easy as username/password you could just have a comm
channel between postmaster and a bgworker for example. But you also
have to implement things liker GSSAPI authentication.

But I think you'll run into a different problem much earlier. Pretty
much everything out there is going to want to speak http(s). How are
you going to terminate that, especially https, on the same port as a
PostgreSQL connection? PostgreSQL will have to reply with it's initial
negotiating byte before anything else is done, including the TLS
negotiation, and that will kill anything http.

And if your metrics endpoint isn't going to speak http, you've given
up the ability for the "plug and play setup".


> > There's good and bad with it. The bug "good" with it is that it's an
> > open standard (openmetrics). I think supporting that would be a very
> > good idea. But it would also be good to have a different, "richer",
> > format available. Whether it'd be worth to go the full "postgresql
> > way" and make it pluggable is questionable, but I would suggest at
> > least having both openmetrics and a native/richer one, and not just
> > the latter. Being able to just point your existing monitoring system
> > at a postgres instance (with auth configured) and have things just
> > shows up is in itself a large value. (Then either pluggable or hooks
> > beyond that, but having both those as native)
>
> Ideally I would want to provide OpenMetrics data that doesn't break
> compatibility with OpenTelemetry -- which I'm still not 100% sure I
> understand but I gather that means following certain conventions about
> metadata. But those standards only have quantitive metrics, no rich
> structured data.

Yeah. That's why I think the reasonable thing is to provide both.


> I assume the idea is that that kind of rich structured data belongs in
> some other system. But I definitely see people squeezing it into
> metrics. For things like replication topology for example.... I would
> love to have a

.... love to have a completed sentence there? :)


> Personally I feel similarly about the inefficiency but I think the
> feeling is that compression makes it irrelevant. I suspect there's a
> fair amount of burnout over predecessors like SNMP that went to a lot
> of trouble to be efficient and implementations were always buggy and
> impenetrable as a result. (The predecessor in Google had some features
> that made it slightly more efficient too but also made it more
> complex. It seems intentional that they didn't carry those over too)

Given the amount of metrics that people pull in through prometheus or
similar today, I think most have just accepted the overhead. You can
of course serve it over protobufs to make it more efficient, but yes
the basic part is still there -- but overall the market has spoken and
said they accept that one.


> Fwiw one constant source of pain is the insistence on putting
> everything into floating point numbers. They have 56 bits of precision
> and that leaves us not quite being able to represent an LSN or 64-bit
> xid for example.

Yeah, it's clearly designed around certain types of metrics only...

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Observability in Postgres
Next
From: Nathan Bossart
Date:
Subject: Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work