Re: Redacting information from logs - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Redacting information from logs
Date
Msg-id 20190803213446.4afdfbaewz2cijl6@development
Whole thread Raw
In response to Redacting information from logs  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Tue, Jul 30, 2019 at 11:54:55AM -0700, Jeff Davis wrote:
>Logs are important to diagnose problems or monitor operations, but logs
>can contain sensitive information which is often unnecessary for these
>purposes. Redacting the sensitive information would enable easier
>access and simpler integration with analysis tools without compromising
>the sensitive information.
>

OK, that's a worthwhile goal. I assume by "sensitive data" you mean user
data, right?

>The challenge is that nobody wants to classify all of the log messages;
>and even if someone did that today, there would be never-ending work in
>the future to try to maintain that classification.
>
>My proposal is:
>
> * redact every '%s' in an ereport by having a special mode for
>snprintf.c (this is possible because we now own snprintf)
> * generate both redacted and unredacted messages (if redaction is
>enabled)
> * choose which destinations (stderr, eventlog, syslog, csvlog) get
>redacted or plain messages
> * emit_log_hook always has both redacted and plain messages available
> * allow specifying a custom redaction function, e.g. a function that
>hashes the string rather than completely redacting it
>
>I think '%s' in a log message is a pretty close match to the kind of
>information that might be sensitive. All data goes through type output
>functions (e.g. the conflicting datum for a unique constraint violation
>message), and most other things that a user might type would go through
>%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
>etc. would be preserved.
>

IMHO the crucial part here is 'might be sensitive'. How often is that
actually true? My guess is 99% of places using %s are not sensitive at
all, and are used for things like filenames, table/attribute names,
and so on. And redacting those parts will make the logs essentially
useless, because we'll get things like this:

    ERROR:  column "******" does not exist at character 10

    ERROR:  division by zero
    CONTEXT:  SQL function "******" during inlining

I'm not sure those are the logs I'd like to see on a production system
while investigating an issue.

>All object names would be redacted, but that's not as bad as it sounds:
>  (a) You can specify a custom redaction function that hashes rather
>than completely redacts. That allows you to see if different messages
>refer to the same object, and also map back to suspected objects if you
>really need to.
>  (b) The unredacted object names are still a part of ErrorData, so you
>can do something interesting with emit_log_hook.

Isn't hashing essentially an information leak, i.e. somewhat undesirable
for sensitive data?

>  (c) You still might have the unredacted logs in a more protected
>place, and can access them when you really need to.
>

The question is whether that's actually an acceptable solution for
deployments that do handle sensitive data ...

>A weakness of this proposal is that it could be confusing to use
>ereport() in combination with snprintf(). If using snprintf to build
>the format string, nothing would be redacted, so you'd have to be
>careful not to expand any %s that might be sensitive. If using snprintf
>to build up an argument, the entire argument would be redacted. The
>first case should not be common, because good coding generally avoids
>non-constant format strings. The second case is just over-redaction,
>which is not necessarily bad.
>
>One annoying case would be if some of the arguments to ereport() are
>used for things like the right number of commas or tabs -- redacting
>those would just make the message look horrible. I didn't find such
>cases but I'm pretty sure they exist. Another annoying case is time,
>which is useful for debugging, but formatted with %s so it gets
>redacted (I did find plenty of these cases).
>
>But I don't see a better solution. Right now, it's a pain to treat log
>files as sensitive things when there are so many ways they can help
>with smooth operations and so many tools available to analyze them.
>This proposal seems like a practical solution to enable better use of
>log files while protecting potentially-sensitive information.
>

Hmm. I wonder how difficult would it be to actually go through the
ereport calls and classify those that can leak sensitive data, and then
do redaction only for those. That's about the only alternative approach
I can think of.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: A couple of random BF failures in kerberosCheck
Next
From: Julien Rouhaud
Date:
Subject: Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activityview?