Re: POC: Extension for adding distributed tracing - pg_tracing - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: POC: Extension for adding distributed tracing - pg_tracing
Date
Msg-id 787b5ad0-f727-473c-ba25-5b9f6813fb80@iki.fi
Whole thread Raw
In response to Re: POC: Extension for adding distributed tracing - pg_tracing  (Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>)
Responses Re: POC: Extension for adding distributed tracing - pg_tracing
List pgsql-hackers
Hi Anthonin,

I'm only now catching up on this thread. Very exciting feature!

My first observation is that you were able to implement this purely as 
an extension, without any core changes. That's very cool, because:

- You don't need buy-in or blessing from anyone else. You can just put 
this on github and people can immediately start using it.
- As an extension, it's not tied to the PostgreSQL release cycle. Users 
can immediately start using it with existing PostgreSQL versions.

There are benefits to being in contrib/ vs. an extension outside the 
PostgreSQL source tree, but there are significant downsides too. 
Out-of-tree, you get more flexibility, and you can develop much faster. 
I think this should live out-of-tree, at least until it's relatively 
stable. By stable, I mean "not changing much", not that it's bug-free.

# Core changes

That said, there are a lot of things that would make sense to integrate 
in PostgreSQL itself. For example:

- You're relying on various existing hooks, but it might be better to 
have the 'begin_span'/'end_span' calls directly in PostgreSQL source 
code in the right places. There are more places where spans might be 
nice, like when performing a large sort in tuplesort.c to name one 
example. Or even across every I/O; not sure how much overhead the spans 
incur and how much we'd be willing to accept. 'begin_span'/'end_span' 
could be new hooks that normally do nothing, but your extension would 
implement them.

- Other extensions could include begin_span/end_span calls too. (It's 
complicated for an extension to call functions in another extension.)

- Extensions like postgres_fdw could get the tracing context and 
propagate it to the remote system too.

- How to pass the tracing context from the client? You went with 
SQLCommenter and others proposed a GUC. A protocol extension would make 
sense too. I can see merits to all of those. It probably would be good 
to support multiple mechanisms, but some might need core changes. It 
might be good to implement the mechanism for accepting trace context in 
core. Without the extension, we could still include the trace ID in the 
logs, for example.

# Further ideas

Some ideas on cool extra features on top of this:

- SQL functions to begin/end a span. Could be handy if you have 
complicated PL/pgSQL functions, for example.
- Have spans over subtransactions.
- Include elog() messages in the traces. You might want to have a lower 
elevel for what's included in traces; something like the 
client_min_messages and log_min_messages settings.
- Include EXPLAIN plans in the traces.

# Comments on the implementation

There was discussion already on push vs pull model. Currently, you 
collect the traces in memory / files, and require a client to poll them. 
A push model is more common in applications that support tracing. If 
Postgres could connect directly to the OTEL collector, you'd need one 
fewer running component. You could keep the traces in backend-private 
memory, no need to deal with shared memory and spilling to files.

Admittedly supporting the OTEL wire protocol is complicated unless you 
use an existing library. Nevertheless, it might be a better tradeoff. 
OTEL/HTTP with JSON format seems just about feasible to implement from 
scratch. Or bite the bullet and use some external library. If this lives 
as an extension, you have more freedom to rely on 3rd party libraries.

(If you publish this as an out-of-tree extension, then this is up to 
you, of course, and doesn't need consensus here on pgsql-hackers)

# Suggested next steps with this

Here's how I'd suggest to proceed with this:

1. Publish this as an extension on github, as it is. I think you'll get 
a lot of immediate adoption.

2. Start a new pgsql-hackers thread on in-core changes that would 
benefit the extension. Write patches for them. I'm thinking of the 
things I listed in the Core changes section above, but maybe there are 
others.


PS. Does any other DBMS implement this? Any lessons to be learned from them?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Next
From: Thomas Munro
Date:
Subject: Re: InstallXLogFileSegment() vs concurrent WAL flush