Thread: log_collector doesn't respond to reloads
Summary: despite pg_reload(), log directory, filename and destination don't change Version: 9.1.3 Platform: RHEL5, custom compile Severity: Persistant annoyance Description: 1) change log_directory in postgresql.conf 2) pg_reload_conf() 3) show log_directory displays the correct new directory 4) logging continues to the old directory Given the custom compile, I'd chalk this up to wonky platform issues, except that I've seen this problem on other systems, intermittently. This particular system has the issue in 100% reproduceable fashion, so I'd like suggestions on how to get to the bottom of it. I also tried a sighup to the logger process, with no effect. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Summary: despite pg_reload(), log directory, filename and destination > don't change Looking at the code, it's really hard to see how this could possibly happen, unless maybe the process is blocking receipt of SIGHUP. Which it shouldn't be. Not sure about RHEL5, but on recent Linuxen you can check the process's signal masks like this: grep ^Sig /proc/NNNN/status where NNNN is the logging collector's PID. Could we see that? regards, tom lane
On 4/26/12 5:50 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Summary: despite pg_reload(), log directory, filename and destination >> don't change > > Looking at the code, it's really hard to see how this could possibly > happen, unless maybe the process is blocking receipt of SIGHUP. Which > it shouldn't be. Not sure about RHEL5, but on recent Linuxen you can > check the process's signal masks like this: > > grep ^Sig /proc/NNNN/status > > where NNNN is the logging collector's PID. Could we see that? SigQ: 0/399360 SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000001007806 SigCgt: 0000000000000201 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 4/26/12 5:50 PM, Tom Lane wrote: >> check the process's signal masks like this: >> grep ^Sig /proc/NNNN/status >> where NNNN is the logging collector's PID. Could we see that? > SigQ: 0/399360 > SigPnd: 0000000000000000 > SigBlk: 0000000000000000 > SigIgn: 0000000001007806 > SigCgt: 0000000000000201 Hm, I'm not sure what SigQ is, but the other values match what I see for a logging collector process on my Fedora 16 box ... and that process responds to SIGHUP just fine. (I believe the rightmost bit of these masks corresponds to SIGHUP, so this is saying the process is catching SIGHUP and not currently blocking it, which is what it should be.) Do you want to try attaching to the collector with a debugger and seeing if it ever gets into the "if (got_SIGHUP)" block in SysLoggerMain? regards, tom lane
On 27/04/12 13:11, Josh Berkus wrote: > On 4/26/12 5:50 PM, Tom Lane wrote: >> Josh Berkus<josh@agliodbs.com> writes: >>> Summary: despite pg_reload(), log directory, filename and destination >>> don't change >> Looking at the code, it's really hard to see how this could possibly >> happen, unless maybe the process is blocking receipt of SIGHUP. Which >> it shouldn't be. Not sure about RHEL5, but on recent Linuxen you can >> check the process's signal masks like this: >> >> grep ^Sig /proc/NNNN/status >> >> where NNNN is the logging collector's PID. Could we see that? > SigQ: 0/399360 > SigPnd: 0000000000000000 > SigBlk: 0000000000000000 > SigIgn: 0000000001007806 > SigCgt: 0000000000000201 > Might be a red herring, but I was able to reproduce this if (and only if) I forgot to create the new dest directory before doing the reload. Subsequently creating the directory and reloading did not result in a file in the new location. Regards Mark
> Do you want to try attaching to the collector with a debugger and seeing > if it ever gets into the "if (got_SIGHUP)" block in SysLoggerMain? Hmmm. No debugger on this system, not unexpectedly. I'll see if I can get authorization to add one. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> Do you want to try attaching to the collector with a debugger and seeing >> if it ever gets into the "if (got_SIGHUP)" block in SysLoggerMain? > Hmmm. No debugger on this system, not unexpectedly. I'll see if I can > get authorization to add one. How about strace --- if that's available, just watching the collector process while you send it SIGHUP might be informative. regards, tom lane
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: > Might be a red herring, but I was able to reproduce this if (and only > if) I forgot to create the new dest directory before doing the reload. > Subsequently creating the directory and reloading did not result in a > file in the new location. Hmm, interesting point. There's this code in there when fopen'ing the log file fails: /* * ENFILE/EMFILE are not too surprising on a busy system; just * keep using the old file till we manage to get a new one. * Otherwise, assume something's wrong with Log_directory and stop * trying to create files. */ if (errno != ENFILE && errno != EMFILE) { ereport(LOG, (errmsg("disabling automatic rotation (use SIGHUP to re-enable)"))); Log_RotationAge = 0; Log_RotationSize = 0; } Whoever wrote that thought that Log_RotationAge/Log_RotationSize would get reset to normal values during SIGHUP, but it's far from clear to me that any such thing would actually happen. However, this would only apply to Josh's problem if he was trying to set a bogus new value of log_directory, eg not there or not writable by postgres. In any case, if this is the locus of the problem, there ought to be instances of that log message in the active log file. (Josh?) Another point here is that the syslogger will attempt to create the log directory if it's not there ... but only once, during startup. Should we have it repeat that after a SIGHUP? regards, tom lane
> Whoever wrote that thought that Log_RotationAge/Log_RotationSize would > get reset to normal values during SIGHUP, but it's far from clear to me > that any such thing would actually happen. However, this would only > apply to Josh's problem if he was trying to set a bogus new value of > log_directory, eg not there or not writable by postgres. In any case, > if this is the locus of the problem, there ought to be instances of that > log message in the active log file. (Josh?) Aha. Yeah, the problem is, directory permissions to the contrary, something is preventing the logger from writing to that directory even though I can do so as the logged-in postgres user. I'll bet it's their SAN dynamic mounter thing, given the trouble it's been before. *sigh*. So this is an issue peculiar to their system, and not a general case. Sorry for the noise. I've seen transitory behavior like this before and had hoped that I'd found a reproduceable test case. But no. We do have one piece of unituitive behavior here though, which forms a bit of a catch-22: 1. DBA changes the log directory 2. Log directory is unwriteable 3. Postgres continues writing to the old log_directory 4. SHOW log_directory displays the *new* log_directory I think (4) here needs to change. We shouldn't be showing a different log directory in pg_settings than we're actually writing to, ever. Now, here's the Catch-22: Consider the case that time elapses before anyone discovers that logs are not being written to the new location, and you change personnel; how would the new DBA have any idea where the old log was so that he could read the log message about the unwriteable directory? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
I wrote: > Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: >> Might be a red herring, but I was able to reproduce this if (and only >> if) I forgot to create the new dest directory before doing the reload. >> Subsequently creating the directory and reloading did not result in a >> file in the new location. > Whoever wrote that thought that Log_RotationAge/Log_RotationSize would > get reset to normal values during SIGHUP, but it's far from clear to me > that any such thing would actually happen. And in fact a bit of experimentation proves that it doesn't, unless there are uncommented log_rotation_age and log_rotation_size entries in postgresql.conf. So this is a bug, not to mention bad practice. I think the code needs to use an explicit "rotation_disabled" flag that it can reset in the SIGHUP block, instead of stomping on GUC variables. However, it remains far from clear whether this has anything to do with Josh's situation. If it does, there should be telltale messages in his log ... regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > We do have one piece of unituitive behavior here though, which forms a > bit of a catch-22: > 1. DBA changes the log directory > 2. Log directory is unwriteable > 3. Postgres continues writing to the old log_directory > 4. SHOW log_directory displays the *new* log_directory > I think (4) here needs to change. Well, sorry, but you can't have that. SHOW will tell you what your own backend thinks the value of the GUC variable is, but there is no way to know what's happening inside the logging collector process. And I'd be against trying to add a signaling mechanism that would support telling you that, because it would add fragility to the logging collector, which we don't want. > Consider the case that time elapses before anyone discovers that logs > are not being written to the new location, I don't buy that argument. If things are working as intended, the collector ought to create a new file in the commanded location immediately. I would think any normal DBA would look for that, just to check that the SIGHUP had worked. We clearly need to fix the "disable rotation" logic, and I think a good case can be made for trying a mkdir() when Log_directory changes, but I don't see that there's more that can be done about this kind of situation. BTW, what log messages were you getting exactly? I'd have expected something about "could not open log file" as well as the "disabling automatic rotation" one. regards, tom lane
> Well, sorry, but you can't have that. SHOW will tell you what your own > backend thinks the value of the GUC variable is, but there is no way to > know what's happening inside the logging collector process. And I'd be > against trying to add a signaling mechanism that would support telling > you that, because it would add fragility to the logging collector, which > we don't want. Hmmm. There's really no feasible way to determine where the actual logging collector is logging to? > I don't buy that argument. If things are working as intended, the > collector ought to create a new file in the commanded location > immediately. I would think any normal DBA would look for that, just > to check that the SIGHUP had worked. You're assuming that the change is happening attended. Consider automated changes being deployed via puppet or chef, possibly across many servers. You can end up in a situation where the logs aren't going where they're supposed to due to some external problem, and that the DBA has no way to find out what went wrong because he doesn't know where the logs are *now*. Mind you, I don't have a solution to suggest, but I think it's a real problem. And fixing the rotation bug will help that a lot; it would narrow down the problem to the current day/hour, which would make it much more likely that someone would know that the log location had changed. > BTW, what log messages were you getting exactly? I'd have expected > something about "could not open log file" as well as the "disabling > automatic rotation" one. 2012-04-26 16:38:21 PDT [10180]: [2-1] user=,db= LOG: received SIGHUP, reloading configuration files 2012-04-26 16:38:21 PDT [10181]: [1-1] user=,db= LOG: could not open log file "/pglogs/check/logs/datacollection-2012-04-26-16-38": No such file or directory 2012-04-26 16:38:21 PDT [10181]: [2-1] user=,db= LOG: disabling automatic rotation (use SIGHUP to re-enable) So, yes, exactly. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > You can end up in a situation where the logs aren't going where they're > supposed to due to some external problem, and that the DBA has no way to > find out what went wrong because he doesn't know where the logs are *now*. Well, if nothing else, lsof would help. Another possibility is that we might change the logging collector process to show its current target filename in ps status (although might there be security/privacy issues with that?). Neither of those things will help Windows users of course, but the sorts of cases you're presenting aren't going to be happening on Windows boxes. [ thinks some more... ] A lower-tech solution would be to always write the name of the current log target file into some small text file in $PGDATA. On the whole though, I think this is an invented problem. We've never heard a complaint from the field about it. regards, tom lane
On Sat, Apr 28, 2012 at 03:35, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> You can end up in a situation where the logs aren't going where they're >> supposed to due to some external problem, and that the DBA has no way to >> find out what went wrong because he doesn't know where the logs are *now= *. > > Well, if nothing else, lsof would help. =A0Another possibility is that we > might change the logging collector process to show its current target > filename in ps status (although might there be security/privacy issues > with that?). =A0Neither of those things will help Windows users > of course, but the sorts of cases you're presenting aren't going to be > happening on Windows boxes. We do have the "fake process status kernel object" that can be used as a last resort. If this is something that would only be used in debugging scenarios, that would be perfectly reasonable I think. > [ thinks some more... ] =A0A lower-tech solution would be to always write > the name of the current log target file into some small text file in > $PGDATA. That seems ugly. > On the whole though, I think this is an invented problem. =A0We've never > heard a complaint from the field about it. I think process title seems reasonable. We do that for archiver for example, to tell you where it's writing, don't we? --=20 =A0Magnus Hagander =A0Me: http://www.hagander.net/ =A0Work: http://www.redpill-linpro.com/
>> On the whole though, I think this is an invented problem. We've never >> heard a complaint from the field about it. > > I think process title seems reasonable. We do that for archiver for > example, to tell you where it's writing, don't we? Yes, we do. This isn't an invented problem; a brief canvass on IRC shows that people run into issues with log_collector reloads fairly commonly. However, it's pretty low priority, certainly -- never rises above the level of "annoyance". The sort of thing where it would be good to have a bugtracker to put it in so the next time someone is working on the log collector for other reasons they can tweak this too, or when some new hacker wants an easy first patch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > This isn't an invented problem; a brief canvass on IRC shows that people > run into issues with log_collector reloads fairly commonly. However, > it's pretty low priority, certainly -- never rises above the level of > "annoyance". The sort of thing where it would be good to have a > bugtracker to put it in so the next time someone is working on the log > collector for other reasons they can tweak this too, or when some new > hacker wants an easy first patch. We have that; it's called TODO. http://wiki.postgresql.org/wiki/Todo regards, tom lane