Re: [HACKERS] Proposal : For Auto-Prewarm. - Mailing list pgsql-hackers
From | Rafia Sabih |
---|---|
Subject | Re: [HACKERS] Proposal : For Auto-Prewarm. |
Date | |
Msg-id | CAOGQiiNwtepFRa=R2-5YBrgxyT+w8GcWyZY0pwA_SiH-OZQpKQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Proposal : For Auto-Prewarm. (Mithun Cy <mithun.cy@enterprisedb.com>) |
Responses |
Re: [HACKERS] Proposal : For Auto-Prewarm.
|
List | pgsql-hackers |
On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> + * contrib/autoprewarm.c >> >> Wrong. > > -- Oops Sorry fixed. > >> + Oid database; /* database */ >> + Oid spcnode; /* tablespace */ >> + Oid filenode; /* relation's filenode. */ >> + ForkNumber forknum; /* fork number */ >> + BlockNumber blocknum; /* block number */ >> >> Why spcnode rather than tablespace? Also, the third line has a >> period, but not the others. I think you could just drop these >> comments; they basically just duplicate the structure names, except >> for spcnode which doesn't but probably should. > > -- Dropped comments and changed spcnode to tablespace. > >> + bool can_do_prewarm; /* if set can't do prewarm task. */ >> >> The comment and the name of the variable imply opposite meanings. > > -- Sorry a typo. Now this variable has been removed as you have > suggested with new variables in AutoPrewarmSharedState. > >> +/* >> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos. >> + */ >> I assume you don't really need this. Presumably process exit is going >> to detach the DSM anyway. > > -- Yes considering process exit will detach the dsm, I have removed > that function. > >> + if (seg == NULL) >> + ereport(ERROR, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("unable to map dynamic shared memory segment"))); >> >> Let's use the wording from the message in parallel.c rather than this >> wording. Actually, we should probably (as a separate patch) change >> test_shm_mq's worker.c to use the parallel.c wording also. > > -- I have corrected the message with "could not map dynamic shared > memory segment" as in parallel.c > >> + SetCurrentStatementStartTimestamp(); >> >> Why do we need this? > > -- Removed Sorry forgot to remove same when I removed the SPI connection code. > >> + StartTransactionCommand(); >> >> Do we need a transaction? If so, how about starting a new transaction >> for each relation instead of using a single one for the entire >> prewarm? > > -- We do relation_open hence need a transaction. As suggested now we > start a new transaction on every new relation. > >> + if (status == BGWH_STOPPED) >> + return; >> + >> + if (status == BGWH_POSTMASTER_DIED) >> + { >> + ereport(ERROR, >> + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), >> + errmsg("cannot start bgworker autoprewarm without postmaster"), >> + errhint("Kill all remaining database processes and restart" >> + " the database."))); >> + } >> + >> + Assert(0); >> >> Instead of writing it this way, remove the first "if" block and change >> the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the >> curly braces in this case. > > -- Fixed as suggested. > >> >> + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R); >> >> Use AllocateFile() instead. See the comments for that function and at >> the top of fd.c. Then you don't need to worry about leaking the file >> handle on an error, so you can remove the fclose() before ereport() > now> stuff -- which is incomplete anyway, because you could fail e.g. >> inside dsm_create(). > > -- Using AllocateFile now. > >> >> + errmsg("Error reading num of elements in \"%s\" for" >> + " autoprewarm : %m", AUTOPREWARM_FILE))); >> >> As I said in a previous review, please do NOT split error messages >> across lines like this. Also, this error message is nothing close to >> PostgreSQL style. Please read >> https://www.postgresql.org/docs/devel/static/error-style-guide.html >> and learn to follow all those guidelines written therein. I see at >> least 3 separate problems with this message. > > -- Thanks, I have tried to fix it now. > >> + num_elements = i; >> >> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm >> block dump has %d entries but expected %d", i, num_elements); It >> seems OK for this to be elog() rather than ereport() because the file >> should never be corrupt unless the user has cheated by hand-editing >> it. > > -- Fixed as suggested. Now eloged as an ERROR. > >> I think you can get rid of next_db_pos altogether, and this >> prewarm_elem thing too. Design sketch: >> >> 1. Move all the stuff that's in prewarm_elem into >> AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and >> end_of_blockinfos to prewarm_stop_idx. >> >> 2. Instead of computing all of the database ranges first and then >> launching workers, do it all in one loop. So figure out where the >> records for the current database end and set prewarm_start_idx and >> prewarm_end_idx appropriately. Launch a worker. When the worker >> terminates, set prewarm_start_idx = prewarm_end_idx and advance >> prewarm_end_idx to the end of the next database's records. >> >> This saves having to allocate memory for the next_db_pos array, and it >> also avoids this crock: >> >> + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem)); >> > > -- Fixed as suggested. > >> The reason that's bad is because it only works so long as bgw_extra is >> large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra >> shrinks, this turns into a buffer overrun. > > -- passing prewarm info through bgw_extra helped us to restrict the > scope and lifetime of prewarm_elem only to prewarm task. Moving them > to shared memory made them global even though they are not needed once > prewarm task is finished. As there are other disadvantages of using > bgw_extra I have now implemented as you have suggested. > >> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating >> the PID for the temporary file. Otherwise, you might leave many >> temporary files behind under different names. If you use the same >> name every time, you'll never have more than one, and the next >> successful dump will end up getting rid of it along the way. > > -- Fixed as sugested. Previosuly PID was used so that concurrent dump > can happen between dump worker and immediate dump as they will write > to two different files. With new way of registering PID before file > access in shared memory I think that problem can be adressed. > >> >> + pfree(block_info_array); >> + CloseTransientFile(fd); >> + unlink(transient_dump_file_path); >> + ereport(ERROR, >> + (errcode_for_file_access(), >> + errmsg("error writing to \"%s\" : %m", > .> + AUTOPREWARM_FILE))); >> >> Again, this is NOT a standard error message text. It occurs in zero >> other places in the source tree. You are not the first person to need >> an error message for a failed write to a file; please look at what the >> previous authors did. Also, the pfree() before report is not needed; >> isn't the whole process going to terminate? Also, you can't really use >> errcode_for_file_access() here, because you've meanwhile done >> CloseTransientFile() and unlink(), which will have clobbered errno. > > -- Removed pfree, saved errno before CloseTransientFile() and unlink() > >> + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks))); >> >> Not project style for ereport(). Line break after the first comma. >> Similarly elsewhere. > > -- Tried to fix same > >> + * dump_now - the main routine which goes through each buffer >> header of buffer >> + * pool and dumps their meta data. We Sort these data and then dump them. >> + * Sorting is necessary as it facilitates sequential read during load. >> >> This is no longer true, because you moved the sort to the load side. >> It's also not capitalized properly. > > -- Sorry removed now. > >> Discussions of the format of the autoprewarm dump file involve >> inexplicably varying number of < and > symbols: >> >> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in >> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it >> + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n", > > -- Sorry fixed now, in all of the places the block info formats will > not have such (</>) delimiter. > >> +#ifndef __AUTOPREWARM_H__ >> +#define __AUTOPREWARM_H__ >> >> We don't use double-underscored names for header files. Please >> consult existing header files for the appropriate style. Also, why >> does this file exist at all, instead of just including them in the .c >> file? The pointer of a header is for things that need to be included >> by multiple .c files, but there's no such need here. > > -- This was done to fix one of the previous review comments. I have > moved them back to .c file. > >> + * load. If there are no other block infos than the global objects >> + * we silently ignore them. Should I throw error? >> >> Silently ignoring them seems fine. Throwing an error doesn't seem >> like it would improve things. > > -- Okay thanks. > >> >> + /* >> + * Register a sub-worker to load new database's block. Wait until the >> + * sub-worker finish its job before launching next sub-worker. >> + */ >> + launch_prewarm_subworker(&pelem); >> >> The function name implies that it launches the worker, but the comment >> implies that it also waits for it to terminate. Functions should be >> named in a way that matches what they do. > > -- Have renamed it to launch_and_wait_for_per_database_worker > >> I feel like the get_autoprewarm_task() function is introducing fairly >> baroque logic for something that really ought to be more simple. All >> autoprewarm_main() really needs to do is: >> >> if (!state->autoprewarm_done) >> autoprewarm(); >> dump_block_info_periodically(); > > -- Have simplified things as suggested now. Function > get_autoprewarm_task has been removed. > >> The locking in autoprewarm_dump_now() is a bit tricky. There are two >> trouble cases. One is that we try to rename() our new dump file on >> top of the existing one while a background worker is still using it to >> perform an autoprewarm. The other is that we try to write a new >> temporary dump file while some other process is trying to do the same >> thing. I think we should fix this by storing a PID in >> AutoPrewarmSharedState; a process which wants to perform a dump or an >> autoprewarm must change the PID from 0 to their own PID, and change it >> back to 0 on successful completion or error exit. If we go to perform >> an immediate dump process and finds a non-zero value already just does >> ereport(ERROR, ...), including the PID of the other process in the >> message (e.g. "unable to perform block dump because dump file is being >> used by PID %d"). In a background worker, if we go to dump and find >> the file in use, log a message (e.g. "skipping block dump because it >> is already being performed by PID %d", "skipping prewarm because block >> dump file is being rewritten by PID %d"). > > -- Fixed as suggested. > >> I also think we should change is_bgworker_running to a PID, so that if >> we try to launch a new worker we can report something like >> "autoprewarm worker is already running under PID %d". > > -- Fixed. I could only "LOG" about another autoprewarm worker already > running and then exit. Because on ERROR we try to restart the worker, > so do not want to restart such workers. > >> So putting that all together, I suppose AutoPrewarmSharedState should >> end up looking like this: >> >> LWLock lock; /* mutual exclusion */ >> pid_t bgworker_pid; /* for main bgworker */ >> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */ > > -- I think one more member is required which state whether prewarm can > be done when the worker restarts. > >> /* following items are for communication with per-database worker */ >> dsm_handle block_info_handle; >> Oid database; >> int prewarm_start_idx; >> int prewarm_stop_idx; > > -- Fixed as suggested > >> I suggest going through and changing "subworker" to "per-database >> worker" throughout. > > -- Fixed as suggested. > >> >> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB? > > I have tried same on my local machine with ssd as a storage. > > settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000. > > Total blocks got dumped > autoprewarm_dump_now > ---------------------- > 1048576 > > 5 different load time based logs > > 1. > 2017-06-04 11:30:26.460 IST [116253] LOG: autoprewarm has started > 2017-06-04 11:30:43.443 IST [116253] LOG: autoprewarm load task ended > -- 17 secs > > 2 > 2017-06-04 11:31:13.565 IST [116291] LOG: autoprewarm has started > 2017-06-04 11:31:30.317 IST [116291] LOG: autoprewarm load task ended > -- 17 secs > > 3. > 2017-06-04 11:32:12.995 IST [116329] LOG: autoprewarm has started > 2017-06-04 11:32:29.982 IST [116329] LOG: autoprewarm load task ended > -- 17 secs > > 4. > 2017-06-04 11:32:58.974 IST [116361] LOG: autoprewarm has started > 2017-06-04 11:33:15.017 IST [116361] LOG: autoprewarm load task ended > -- 17secs > > 5. > 2017-06-04 12:15:49.772 IST [117936] LOG: autoprewarm has started > 2017-06-04 12:16:11.012 IST [117936] LOG: autoprewarm load task ended > -- 22 secs. > > So mostly from 17 to 22 secs. > > But I think I need to do tests on a larger set of configuration on > different storage types. I shall do same and upload later. I have also > uploaded latest performance test results (on my local machine ssd > drive) > configuration: shared_buffer = 8GB, > test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1 > > TEST > PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1 > -c 1 --select-only postgres" > START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN > | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME)); > echo $TIME, $TPS; done > > I had a look at the patch from stylistic/formatting point of view, please find the attached patch for the suggested modifications. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: