Thread: more autovacuum fixes
I attach a patch to fix some issues in the autovac process handling code. Mainly, instead of rechecking periodically in the launcher whether a worker was able to start or not, what we do is signal it from the postmaster when it fails. In order to do this I had to make the postmaster set a flag in shared memory, much like the pmsignal.c code does. According to previous discussions this is an acceptable thing for postmaster to do. I must admit I did not try to fill shared memory with garbage and see if it caused a postmaster crash. What I did test was creating a database, making sure that the launcher was picking it up, and then delete the database's directory. This correctly causes a worker to log errors when trying to connect, and it is handled well -- i.e. other databases are not starved. (One remaining problem I see is that a database in this state, or similar, that fails to be vacuumed, *will* cause starvation if in danger of Xid wraparound). Another thing I noticed is that I can change the "autovacuum" parameter in postgresql.conf and send SIGHUP to postmaster, which correctly starts the launcher if it was previously disabled. To make the shutdown case act accordingly, I made the launcher check the "start daemon" flag after rereading the config file, and quit if it's off. I tried multiple combinations of having it set and unset, changing stats_collect_tuple_level and it works fine. One problem with the patch is this (new code): bn = (Backend *) malloc(sizeof(Backend)); ! if (bn) { ! bn->pid = StartAutoVacWorker(); ! bn->is_autovacuum = true; ! /* we don't need a cancel key */ ! if (bn->pid > 0) ! { ! /* FIXME -- unchecked memory allocation here */ ! DLAddHead(BackendList, DLNewElem(bn)); If the palloc() inside DLNewElem fails, we will fail to report a "fork failure" to the launcher. I am not sure how serious this is. One idea that came to mind was using a PG_TRY block, sending the signal in the CATCH block, and then rethrowing the exception. Is this acceptable? All in all, this patch increases the reliability of autovacuum, so I intend to apply it shortly unless there are objections. Further improvements might come later as I become aware of possible fixes. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
Attachment
Alvaro Herrera wrote: > One problem with the patch is this (new code): > > bn = (Backend *) malloc(sizeof(Backend)); > ! if (bn) > { > ! bn->pid = StartAutoVacWorker(); > ! bn->is_autovacuum = true; > ! /* we don't need a cancel key */ > > ! if (bn->pid > 0) > ! { > ! /* FIXME -- unchecked memory allocation here */ > ! DLAddHead(BackendList, DLNewElem(bn)); > > > If the palloc() inside DLNewElem fails, we will fail to report a "fork > failure" to the launcher. I am not sure how serious this is. One idea > that came to mind was using a PG_TRY block, sending the signal in the > CATCH block, and then rethrowing the exception. Is this acceptable? I noticed another problem: the worker may fail during BaseInit() or InitProcess(). This is not where most problems will be (that would be later, in InitPostgres(), which is when the worker connects to a DB) but still could cause a starvation problem, I think. Maybe the PG_TRY block is called for in there, as well as the postmaster code. -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "The ability to monopolize a planet is insignificant next to the power of the source"
Alvaro Herrera wrote: > One problem with the patch is this (new code): > > bn = (Backend *) malloc(sizeof(Backend)); > ! if (bn) > { > ! bn->pid = StartAutoVacWorker(); > ! bn->is_autovacuum = true; > ! /* we don't need a cancel key */ > > ! if (bn->pid > 0) > ! { > ! /* FIXME -- unchecked memory allocation here */ > ! DLAddHead(BackendList, DLNewElem(bn)); > > > If the palloc() inside DLNewElem fails, we will fail to report a "fork > failure" to the launcher. I am not sure how serious this is. Turns out that this problem is not serious at all, because if that palloc() fails, the whole postmaster will exit with a FATAL out of memory message. The problems in the worker code after fork are still an issue though. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "MySQL is a toy compared to PostgreSQL." (Randal L. Schwartz) (http://archives.postgresql.org/pgsql-general/2005-07/msg00517.php)
Alvaro Herrera wrote: > Turns out that this problem is not serious at all, because if that > palloc() fails, the whole postmaster will exit with a FATAL out of > memory message. > > The problems in the worker code after fork are still an issue though. It _is_ still an issue -- and a very serious issue at that. If a worker fails before getting its entry from the startingWorker pointer, then (patched) autovacuum will no longer run any task. Since it doesn't seem like it is possible to signal the launcher until the worker has set up shared memory, I think we should leave the current shenanigans in place. They are really ugly code and I'd love to get rid of them, but currently I don't see any better mechanism to deal with this failure scenario. Thanks for your attention ;-) -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. (Don Knuth)
Alvaro Herrera wrote: > Alvaro Herrera wrote: > > > Turns out that this problem is not serious at all, because if that > > palloc() fails, the whole postmaster will exit with a FATAL out of > > memory message. > > > > The problems in the worker code after fork are still an issue though. > > It _is_ still an issue -- and a very serious issue at that. If a worker > fails before getting its entry from the startingWorker pointer, then > (patched) autovacuum will no longer run any task. I figured that I could keep the old check there for when the worker failed, but still add the new signalling mechanism so that a fork() failure (which I would think is the most likely of all) is taken care of in a more timely manner. I've also improved the rest of the code and comments a bit, and the new code does seem better now. So I'll go ahead and commit it later today. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"