Thread: Normalize queries starting with SET for pg_stat_statements
I saw a database recently where some app was inserting the source port into the application_name field, which meant that pg_stat_statements.max was quickly reached and queries were simply pouring in and out of pg_stat_statements, dominated by some "SET application_name = 'myapp 10.0.0.1:1234'" calls. Which got me thinking, is there really any value to having non-normalized 'SET application_name' queries inside of pg_stat_statements? Or any SET stuff, for that matter?
Attached please find a small proof-of-concept for normalizing/de-jumbling certain SET queries. Because we only want to cover the VAR_SET_VALUE parts of VariableSetStmt, a custom jumble func was needed. There are a lot of funky SET things inside of gram.y as well that don't do the standard SET X = Y formula (e.g. SET TIME ZONE, SET SCHEMA). I tried to handle those as best I could, and carved a couple of exceptions for time zones and xml.
I'm not sure where else to possibly draw lines. Obviously calls to time zone have a small and finite pool of possible values, so easy enough to exclude them, while things like application_name and work_mem are fairly infinite, so great candidates for normalizing. One could argue for simply normalizing everything, as SET is trivially fast for purposes of performance tracking via pg_stat_statements, so who cares if we don't have the exact string? That's what regular logging is for, after all. Most importantly, less unique queryids means less chance that errant SETs will crowd out the more important stuff.
In summary, we want to change this:
SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
1 | set application_name = 'alice'
1 | set application_name = 'bob'
1 | set application_name = 'eve'
1 | set application_name = 'mallory'
to this:
SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1;
4 | set application_name = $1
I haven't updated the regression tests yet, until we reach a consensus on how thorough the normalizing should be. But there is a new test to exercise the changes in gram.y.
Cheers,
Greg
Attachment
On Mon, Jul 22, 2024 at 03:23:50PM -0400, Greg Sabino Mullane wrote: > I saw a database recently where some app was inserting the source port into > the application_name field, which meant that pg_stat_statements.max was > quickly reached and queries were simply pouring in and out of > pg_stat_statements, dominated by some "SET application_name = 'myapp > 10.0.0.1:1234'" calls. Which got me thinking, is there really any value to > having non-normalized 'SET application_name' queries inside of > pg_stat_statements? Or any SET stuff, for that matter? Thanks for beginning this discussion. This has been mentioned in the past, like here, but I never came back to it: https://www.postgresql.org/message-id/B44FA29D-EBD0-4DD9-ABC2-16F1CB087074@amazon.com I've seen complaints about that in the field myself, and like any other specific workloads, the bloat this stuff creates can be really annoying when utilities are tracked. So yes, I really want more stuff to happen here. > Attached please find a small proof-of-concept for normalizing/de-jumbling > certain SET queries. Because we only want to cover the VAR_SET_VALUE parts > of VariableSetStmt, a custom jumble func was needed. There are a lot of > funky SET things inside of gram.y as well that don't do the standard SET X > = Y formula (e.g. SET TIME ZONE, SET SCHEMA). I tried to handle those as > best I could, and carved a couple of exceptions for time zones and xml. Agreed about the use of a custom jumble function. A huge issue that I have with this parsing node is how much we want to control back in the monitoring reports while not making the changes too invasive in the structure of VariableSetStmt. The balance between invasiveness and level of normalization was the tricky part for me. > I'm not sure where else to possibly draw lines. Obviously calls to time > zone have a small and finite pool of possible values, so easy enough to > exclude them, while things like application_name and work_mem are fairly > infinite, so great candidates for normalizing. One could argue for simply > normalizing everything, as SET is trivially fast for purposes of > performance tracking via pg_stat_statements, so who cares if we don't have > the exact string? That's what regular logging is for, after all. Most > importantly, less unique queryids means less chance that errant SETs will > crowd out the more important stuff. > In summary, we want to change this: > > SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1; > 1 | set application_name = 'alice' > 1 | set application_name = 'bob' > 1 | set application_name = 'eve' > 1 | set application_name = 'mallory' > > to this: > > SELECT calls, query from pg_stat_statements where query ~ 'set' order by 1; > 4 | set application_name = $1 Yep. That makes sense to me. We should keep the parameter name, hide the value. CallStmt does that. > I haven't updated the regression tests yet, until we reach a consensus on > how thorough the normalizing should be. But there is a new test to exercise > the changes in gram.y. It would be nice to maximize the contents of the tests at SQL level. The number of patterns to track makes the debuggability much harder to track correctly in TAP as we may rely on more complex quals in pg_stat_statements or even other catalogs. + if (expr->kind != VAR_SET_VALUE || + strcmp(expr->name, "timezone") == 0 + || strcmp(expr->name, "xmloption") == 0) + JUMBLE_NODE(args); We should do this kind of tracking with a new flag in the structure itself, not in the custom function. The point would be to have folks introducing new hardcoded names in VariableSetStmt in the parser think about what they want to do, not second-guess it by tweaking the custom jumbling function. Relying on the location would not be enough as we need to cover document_or_content for xmloption or the default keyword for TIME ZONE. Let's just use the same trick as DeallocateStmt.isall, with a new field to differentiate all these cases :) There are some funky cases you are not mentioning, though, like SET in a CREATE FUNCTION: CREATE OR REPLACE FUNCTION foo_function(data1 text) RETURNS text AS $$ DECLARE res text; BEGIN SELECT data1::text INTO res; RETURN res; END; $$ LANGUAGE plpgsql IMMUTABLE SET search_path = 'pg_class,public'; Your patch silences the SET value, but perhaps we should not do that for this case. I am not against normalizing that, actually, I am in favor of it, because it makes the implementation much easier and the FunctionParameter List of parameters is jumbled with CreateFunctionStmt. All that requires test coverage. It's nice to see that you are able to keep SET TRANSACTION at their current level with the location trick. You have issues with RESET SESSION AUTHORIZATION. This one is easy: update the location field to -1 in reset_rest for all the subcommands. The coverage of the regression tests in pg_stat_statements is mostly right. I may be missing something, but all the SQL queries you have in your 020_jumbles.pl would work fine with SQL tests, and some like `SET param = value` are already covered. -- Michael
Attachment
Now that I've spent some time away from this, I'm reconsidering why we are going through all the trouble of semi-jumbling SET statements. Maybe we just keep it simple and everything becomes "SET myvar = $1" or even "SET myvar" full stop? I'm having a hard time finding a real-world situation in which we need to distinguish different SET/RESET items within pg_stat_statements.
Cheers,
Greg