Re: Logging parallel worker draught - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Logging parallel worker draught |
Date | |
Msg-id | 0d08b978-10cb-4512-bcef-ca1f192049ab@enterprisedb.com Whole thread Raw |
In response to | Re: Logging parallel worker draught ("Imseih (AWS), Sami" <simseih@amazon.com>) |
Responses |
Re: Logging parallel worker draught
Re: Logging parallel worker draught |
List | pgsql-hackers |
Hi, I see the thread went a bit quiet, but I think it'd be very useful (and desirable) to have this information in log. So let me share my thoughts about the patch / how it should work. The patch is pretty straightforward, I don't have any comments about the code as is. Obviously, some of the following comments might require changes to the patch, etc. I see there was discussion about logging vs. adding this to the pgstats system. I agree with Alvaro it should not be one or the other - these are complementary approaches, used by different tools. The logging is needed for tools like pgbadger etc. for example. I do see a lot of value in adding this to the statistics collector, and to things like pg_stat_statements, but that's being discussed in a separate thread, so I'll comment on that there. As for whether to have a GUC for this, I certainly think we should have one. We have GUC for logging stuff that we generally expect to happen, like lock waits, temp files, etc. True, there are similar cases that we just log every time, like when we can't fork a process ("could not fork autovacuum launcher process"), but I'd say those are unexpected to happen in general / seem more like an error in the environment. While we may exhaust parallel workers pretty easily / often, especially on busy systems. There's a couple things I'm not quite sure about: 1) name of the GUC I find the "log_parallel_worker_draught" to be rather unclear :-( Maybe it's just me and everyone else just immediately understands what this does / what will happen after it's set to "on", but I find it rather non-intuitive. 2) logging just the failures provides an incomplete view As a DBA trying to evaluate if I need to bump up the number of workers, I'd be interested what fraction of parallel workers fails to start. If it's 1%, that's probably a short spike and I don't need to do anything. If it's 50%, well, that might have unpredictable impact on user queries, and thus something I may need to look into. But if we only log the failures, that's not possible. I may be able to approximate this somehow by correlating this with the query/transaction rate, or something, but ideally I'd like to know how many parallel workers we planned to use, and how many actually started. Obviously, logging this for every Gather [Merge] node, even when all the workers start, that may be a lot of data. Perhaps the GUC should not be on/off, but something like log_parallel_workers = {none | failures | all} where "failures" only logs when at least one worker fails to start, and "all" logs everything. AFAIK Sami made the same observation/argument in his last message [1]. 3) node-level or query-level? There's a brief discussion about whether this should be a node-level or query-level thing in [2]: > I wonder if it will be better to accumulate the total # of workers > planned and # of workers launched and logging this information at the > end of execution? And there's a reference to log_temp_files, but it's not clear to me if that's meant to be an argument for doing it the same way (node-level). I think we should not do node-level logging just because that's what log_temp_files=on dies. I personally think log_temp_files was implemented like this mostly because it was simple. There's no value in having information about every individual temp file, because we don't even know which node produced it. It'd be perfectly sufficient to log some summary statistics (number of files, total space, ...), either for the query as a whole, or perhaps per node (because that's what matters for work_mem). So I don't think we should mimic this just because log_temp_files does this. I don't know how difficult would it be to track/collect this information for the whole query. regards [1] https://www.postgresql.org/message-id/D04977E3-9F54-452C-A4C4-CDA67F392BD1%40amazon.com [2] https://www.postgresql.org/message-id/11e34b80-b0a6-e2e4-1606-1f5077379a34%40dalibo.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: