Redacting information from logs - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Redacting information from logs |
Date | |
Msg-id | f859d7d263c263d12b638c67ede65d937640740f.camel@j-davis.com Whole thread Raw |
Responses |
Re: Redacting information from logs
Re: Redacting information from logs Re: Redacting information from logs |
List | pgsql-hackers |
Logs are important to diagnose problems or monitor operations, but logs can contain sensitive information which is often unnecessary for these purposes. Redacting the sensitive information would enable easier access and simpler integration with analysis tools without compromising the sensitive information. The challenge is that nobody wants to classify all of the log messages; and even if someone did that today, there would be never-ending work in the future to try to maintain that classification. My proposal is: * redact every '%s' in an ereport by having a special mode for snprintf.c (this is possible because we now own snprintf) * generate both redacted and unredacted messages (if redaction is enabled) * choose which destinations (stderr, eventlog, syslog, csvlog) get redacted or plain messages * emit_log_hook always has both redacted and plain messages available * allow specifying a custom redaction function, e.g. a function that hashes the string rather than completely redacting it I think '%s' in a log message is a pretty close match to the kind of information that might be sensitive. All data goes through type output functions (e.g. the conflicting datum for a unique constraint violation message), and most other things that a user might type would go through %s. A lot of other information useful in logs, like LSNs, %m's, PIDs, etc. would be preserved. All object names would be redacted, but that's not as bad as it sounds: (a) You can specify a custom redaction function that hashes rather than completely redacts. That allows you to see if different messages refer to the same object, and also map back to suspected objects if you really need to. (b) The unredacted object names are still a part of ErrorData, so you can do something interesting with emit_log_hook. (c) You still might have the unredacted logs in a more protected place, and can access them when you really need to. A weakness of this proposal is that it could be confusing to use ereport() in combination with snprintf(). If using snprintf to build the format string, nothing would be redacted, so you'd have to be careful not to expand any %s that might be sensitive. If using snprintf to build up an argument, the entire argument would be redacted. The first case should not be common, because good coding generally avoids non-constant format strings. The second case is just over-redaction, which is not necessarily bad. One annoying case would be if some of the arguments to ereport() are used for things like the right number of commas or tabs -- redacting those would just make the message look horrible. I didn't find such cases but I'm pretty sure they exist. Another annoying case is time, which is useful for debugging, but formatted with %s so it gets redacted (I did find plenty of these cases). But I don't see a better solution. Right now, it's a pain to treat log files as sensitive things when there are so many ways they can help with smooth operations and so many tools available to analyze them. This proposal seems like a practical solution to enable better use of log files while protecting potentially-sensitive information. Attached is a WIP patch. Regards, Jeff Davis
Attachment
pgsql-hackers by date: