Redacting information from logs - Mailing list pgsql-hackers

From Jeff Davis
Subject Redacting information from logs
Date
Msg-id f859d7d263c263d12b638c67ede65d937640740f.camel@j-davis.com
Whole thread Raw
Responses Re: Redacting information from logs
Re: Redacting information from logs
Re: Redacting information from logs
List pgsql-hackers
Logs are important to diagnose problems or monitor operations, but logs
can contain sensitive information which is often unnecessary for these
purposes. Redacting the sensitive information would enable easier
access and simpler integration with analysis tools without compromising
the sensitive information.

The challenge is that nobody wants to classify all of the log messages;
and even if someone did that today, there would be never-ending work in
the future to try to maintain that classification.

My proposal is:

 * redact every '%s' in an ereport by having a special mode for
snprintf.c (this is possible because we now own snprintf)
 * generate both redacted and unredacted messages (if redaction is
enabled)
 * choose which destinations (stderr, eventlog, syslog, csvlog) get
redacted or plain messages
 * emit_log_hook always has both redacted and plain messages available
 * allow specifying a custom redaction function, e.g. a function that
hashes the string rather than completely redacting it

I think '%s' in a log message is a pretty close match to the kind of
information that might be sensitive. All data goes through type output
functions (e.g. the conflicting datum for a unique constraint violation
message), and most other things that a user might type would go through
%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
etc. would be preserved.

All object names would be redacted, but that's not as bad as it sounds:
  (a) You can specify a custom redaction function that hashes rather
than completely redacts. That allows you to see if different messages
refer to the same object, and also map back to suspected objects if you
really need to.
  (b) The unredacted object names are still a part of ErrorData, so you
can do something interesting with emit_log_hook.
  (c) You still might have the unredacted logs in a more protected
place, and can access them when you really need to.

A weakness of this proposal is that it could be confusing to use
ereport() in combination with snprintf(). If using snprintf to build
the format string, nothing would be redacted, so you'd have to be
careful not to expand any %s that might be sensitive. If using snprintf
to build up an argument, the entire argument would be redacted. The
first case should not be common, because good coding generally avoids
non-constant format strings. The second case is just over-redaction,
which is not necessarily bad.

One annoying case would be if some of the arguments to ereport() are
used for things like the right number of commas or tabs -- redacting
those would just make the message look horrible. I didn't find such
cases but I'm pretty sure they exist. Another annoying case is time,
which is useful for debugging, but formatted with %s so it gets
redacted (I did find plenty of these cases).

But I don't see a better solution. Right now, it's a pain to treat log
files as sensitive things when there are so many ways they can help
with smooth operations and so many tools available to analyze them.
This proposal seems like a practical solution to enable better use of
log files while protecting potentially-sensitive information.

Attached is a WIP patch.

Regards,
    Jeff Davis

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Initdb failure
Next
From: Heikki Linnakangas
Date:
Subject: Re: Allow table AM's to cache stuff in relcache