Thread: libpq: Fix lots of discrepancies in PQtrace
After initially trying to add trace support for StartupMessage/SSLRequest/GSSENCRequest[1] I realized there were many more cases where PQtrace was not correctly tracing messages, or not even tracing them at all. These patches fix all of the issues that I was able to find. 0001 is some cleanup after f4b54e1ed9 0002 does some preparatory changes for 0004 & 0007 All the others improve the tracing, and apart from 0004 and 0007 depending on 0002, none of these patches depend on each other. Although you could argue that 0007 and 0008 depend on 0006, because without 0006 the code added by 0007 and 0008 won't ever really be executed. To test you can add a PQreset(conn) call to the start of the test_cancel function in: src/test/modules/libpq_pipeline/libpq_pipeline.c. And then run: ninja -C build all install-quiet && build/src/test/modules/libpq_pipeline/libpq_pipeline cancel 'port=5432' -t test.trace And then look at the top of test.trace [1]: https://www.postgresql.org/message-id/CAGECzQTTN5aGqtDaRifJXPyd_O5qHWQcOxsHJsDSVNqMugGNEA%40mail.gmail.com
Attachment
- v1-0001-libpq-Use-PqMsg-macros-in-fe-auth.c.patch
- v1-0002-libpq-Add-suppress-argument-to-pqTraceOutputNchar.patch
- v1-0003-libpq-Trace-StartupMessage-SSLRequest-GSSENCReque.patch
- v1-0004-libpq-Trace-frontend-authentication-challenges.patch
- v1-0005-libpq-Trace-responses-to-SSLRequest-and-GSSENCReq.patch
- v1-0006-libpq-Trace-all-messages-received-from-the-server.patch
- v1-0007-libpq-Trace-server-Authentication-messages-in-det.patch
- v1-0008-libpq-Trace-NegotiateProtocolVersion-correctly.patch
On Fri, Jun 21, 2024 at 11:22:05AM +0200, Jelte Fennema-Nio wrote: > 0001 is some cleanup after f4b54e1ed9 Oops. I'll plan on committing this after the 17beta2 release freeze is lifted. -- nathan
On Fri, Jun 21, 2024 at 04:01:55PM -0500, Nathan Bossart wrote: > On Fri, Jun 21, 2024 at 11:22:05AM +0200, Jelte Fennema-Nio wrote: >> 0001 is some cleanup after f4b54e1ed9 > > Oops. I'll plan on committing this after the 17beta2 release freeze is > lifted. Committed 0001. -- nathan
On 2024-Jun-26, Nathan Bossart wrote: > On Fri, Jun 21, 2024 at 04:01:55PM -0500, Nathan Bossart wrote: > > On Fri, Jun 21, 2024 at 11:22:05AM +0200, Jelte Fennema-Nio wrote: > >> 0001 is some cleanup after f4b54e1ed9 > > > > Oops. I'll plan on committing this after the 17beta2 release freeze is > > lifted. > > Committed 0001. Thanks, Nathan. I'm holding myself responsible for the rest ... will handle soon after the branch. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "The problem with the future is that it keeps turning into the present" (Hobbes)
On Wed, 26 Jun 2024 at 18:36, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Thanks, Nathan. I'm holding myself responsible for the rest ... will > handle soon after the branch. Sounds great. Out of curiosity, what is the backpatching policy for something like this? Honestly most of these patches could be considered bugfixes in PQtrace, so backpatching might make sense. OTOH I don't think PQtrace is used very much in practice, so backpatching might carry more risk than it's worth.
On Wed, Jun 26, 2024 at 10:02:08PM +0200, Jelte Fennema-Nio wrote: > On Wed, 26 Jun 2024 at 18:36, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Thanks, Nathan. I'm holding myself responsible for the rest ... will > > handle soon after the branch. > > Sounds great. Out of curiosity, what is the backpatching policy for > something like this? Honestly most of these patches could be > considered bugfixes in PQtrace, so backpatching might make sense. OTOH > I don't think PQtrace is used very much in practice, so backpatching > might carry more risk than it's worth. 0001 getting on HEAD after the feature freeze as a cleanup piece cleanup is no big deal. That's cosmetic, still OK. Looking at the whole, the rest of the patch set qualifies as a new feature, even if they're aimed at closing existing gaps. Particularly, you have bits of new infrastructure introduced in libpq like the current_auth_response business in 0004, making it a new feature by structure. + conn->current_auth_response = AUTH_RESP_PASSWORD; ret = pqPacketSend(conn, PqMsg_PasswordMessage, pwd_to_send, strlen(pwd_to_send) + 1); + conn->current_auth_response = AUTH_RESP_NONE; It's a surprising approach. Future callers of pqPacketSend() and pqPutMsgEnd() would easily miss that this flag should be set, as much as reset. Isn't that something that should be added in input of these functions? + AuthResponseType current_auth_response; I'd recommend to document what this flag is here for, with a comment. -- Michael
Attachment
On Thu, 27 Jun 2024 at 07:39, Michael Paquier <michael@paquier.xyz> wrote: > Looking at the whole, the rest of the patch set qualifies as a new > feature, even if they're aimed at closing existing gaps. Alright, seems reasonable. To be clear, I don't care at all about this being backported personally. > Particularly, you have bits of new infrastructure introduced in libpq > like the current_auth_response business in 0004, making it a new > feature by structure. > > + conn->current_auth_response = AUTH_RESP_PASSWORD; > ret = pqPacketSend(conn, PqMsg_PasswordMessage, pwd_to_send, strlen(pwd_to_send) + 1); > + conn->current_auth_response = AUTH_RESP_NONE; > > It's a surprising approach. Future callers of pqPacketSend() and > pqPutMsgEnd() would easily miss that this flag should be set, as much > as reset. Isn't that something that should be added in input of these > functions? Yeah, I'm not entirely happy about it either. But adding an argument to pqPutMsgEnd and pqPutPacketSend would mean all the existing call sites would need to change, even though only 4 of them would care about the new argument. You could argue that it's the better solution, but it would at least greatly increase the size of the diff. Of course to reduce the diff size you could make the old functions a wrapper around a new one with the extra argument, but I couldn't think of a good name for those functions. Overall I went for the chosen approach here, because it only impacted code at the call sites for these auth packets (which are the only v3 packets in the protocol that you cannot interpret based on their contents alone). I think your worry about easily missing to set/clear the flag is not a huge problem in practice. We almost never add new authentication messages and it's only needed there. Also the clearing is not even strictly necessary for the tracing to behave correctly, but it seemed like the right thing to do. > + AuthResponseType current_auth_response; > I'd recommend to document what this flag is here for, with a comment. Oops, yeah I forgot about that. Done now.
Attachment
- v2-0002-libpq-Add-suppress-argument-to-pqTraceOutputNchar.patch
- v2-0003-libpq-Trace-StartupMessage-SSLRequest-GSSENCReque.patch
- v2-0004-libpq-Trace-frontend-authentication-challenges.patch
- v2-0005-libpq-Trace-responses-to-SSLRequest-and-GSSENCReq.patch
- v2-0006-libpq-Trace-all-messages-received-from-the-server.patch
- v2-0007-libpq-Trace-server-Authentication-messages-in-det.patch
- v2-0008-libpq-Trace-NegotiateProtocolVersion-correctly.patch
Pushed 0002 and 0003. On the latter: I decided against using int32 to print the request identifiers; by splitting into two int16's, we see that the numbers match the values in the PG_PROTOCOL() declarations: 2024-08-09 17:37:38.364622 F 8 SSLRequest 1234 5679 and 2024-08-09 17:37:38.422109 F 16 CancelRequest 1234 5678 NNNN NNNN (I didn't verify GSSEncRequest directly.) I also verified that in non-regress mode, the values printed by CancelRequest match those in the BackendKeyData message, 2024-08-09 17:34:27.544686 B 12 BackendKeyData NNNN NNNN I also added suppression in regress mode for the backend key in the CancelRequest message, since they would be different each time. There are no tests for this code. We could add a trace file for the connection packet in libpq_pipeline by changing PQconnectdb() to PQconnectStart() and then do PQtrace before polling until the connection is ready; we would have to have it match for the TAP test. Not sure this is worth the effort. But doing this in a very crude way allowed me to verify that, at least on my machine, this code is doing what's expected. Thank you, -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Regarding 0004: I don't want to add 4 bytes to struct pg_conn for tracing support. I'm tempted to make the new struct member a plain 'char' to reduce overhead for a feature that almost nobody is going to use. According to pahole we have a 3 bytes hole in that position of the struct, so if we make it a 1- or 2-byte member, there's no storage overhead whatsoever. Also, why not have pqTraceOutputMessage() responsible for resetting the byte after printing the message? It seems to cause less undesirable detritus. I propose something like the attached, but it's as yet untested. What do you think? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El sentido de las cosas no viene de las cosas, sino de las inteligencias que las aplican a sus problemas diarios en busca del progreso." (Ernesto Hernández-Novich)
Attachment
On Sat, 10 Aug 2024 at 01:08, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I don't want to add 4 bytes to struct pg_conn for tracing support. I'm > tempted to make the new struct member a plain 'char' to reduce overhead > for a feature that almost nobody is going to use. According to pahole > we have a 3 bytes hole in that position of the struct, so if we make it > a 1- or 2-byte member, there's no storage overhead whatsoever. Sounds fine to me. > Also, why not have pqTraceOutputMessage() responsible for resetting the > byte after printing the message? It seems to cause less undesirable > detritus. Yeah, that's indeed much nicer. > I propose something like the attached, but it's as yet untested. What > do you think? Looks good, but I haven't tested it yet either.
On 2024-Aug-10, Jelte Fennema-Nio wrote: > On Sat, 10 Aug 2024 at 01:08, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I propose something like the attached, but it's as yet untested. What > > do you think? > > Looks good, but I haven't tested it yet either. I tested the SASL exchange and it looks OK. Didn't test the other ones. Thanks! -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Wed, 14 Aug 2024 at 19:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > - to 0005 I change your pqTraceOutputEncryptionRequestResponse() > function name to pqTraceOutputCharResponse and instead of attaching > the "Response" literal in the outpuer to the name given in the > function call, just pass the whole string as argument to the function. Fine by me > - to 0006 I change function name pqFinishParsingMessage() to > pqParseDone() and reworded the commentary; also moved it to fe-misc.c. > Looks good otherwise. The following removed comments seems useful to keep (I realize I already removed them in a previous version of the patch, but I don't think I did that on purpose) - /* Drop the processed message and loop around for another */ - /* consume the message and exit */ - /* Completed this message, keep going */ - /* trust the specified message length as what to skip */ > - 0008 to fix NegotiateProtocolVersion looks correct per [1], but I > don't know how to test it. Suggestions? Two options: 1. Manually change code to make sure SendNegotiateProtocolVersion is called in src/backend/tcop/backend_startup.c 2. Apply my patches from this thread[2] and use max_protocol_version=latest in the connection string while connecting to an older postgres server. [2]: https://www.postgresql.org/message-id/flat/CAGECzQTyXDNtMXdq2L-Wp%3DOvOCPa07r6%2BU_MGb%3D%3Dh90MrfT%2BfQ%40mail.gmail.com#1b8cda3523555aafae89cc04293b8613
Hello, On 2024-Aug-14, Jelte Fennema-Nio wrote: > The following removed comments seems useful to keep (I realize I > already removed them in a previous version of the patch, but I don't > think I did that on purpose) > [...] Ah, yeah, I agree. I put them back, and pushed 0005, 6 and 7 as a single commit. It didn't seem worth pushing each separately, really. I added two lines for the CopyData message as well, since otherwise the output shows the "mismatched length" error when getting COPY data. I'm leaving 0008 to whoever is doing the NegotiateProtocolVersion stuff; maybe post that one in that thread you mentioned. I'll mark this CF entry committed. Many thanks! -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/