Re: [HACKERS] Proposal : For Auto-Prewarm. - Mailing list pgsql-hackers

From Rafia Sabih
Subject Re: [HACKERS] Proposal : For Auto-Prewarm.
Date
Msg-id CAOGQiiNwtepFRa=R2-5YBrgxyT+w8GcWyZY0pwA_SiH-OZQpKQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Proposal : For Auto-Prewarm.  (Mithun Cy <mithun.cy@enterprisedb.com>)
Responses Re: [HACKERS] Proposal : For Auto-Prewarm.
List pgsql-hackers
On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> + *        contrib/autoprewarm.c
>>
>> Wrong.
>
> -- Oops Sorry fixed.
>
>> +    Oid            database;        /* database */
>> +    Oid            spcnode;        /* tablespace */
>> +    Oid            filenode;        /* relation's filenode. */
>> +    ForkNumber    forknum;        /* fork number */
>> +    BlockNumber blocknum;        /* block number */
>>
>> Why spcnode rather than tablespace?  Also, the third line has a
>> period, but not the others.  I think you could just drop these
>> comments; they basically just duplicate the structure names, except
>> for spcnode which doesn't but probably should.
>
> -- Dropped comments and changed spcnode to tablespace.
>
>> +    bool        can_do_prewarm; /* if set can't do prewarm task. */
>>
>> The comment and the name of the variable imply opposite meanings.
>
> -- Sorry a typo. Now this variable has been removed as you have
> suggested with new variables in AutoPrewarmSharedState.
>
>> +/*
>> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
>> + */
>> I assume you don't really need this.  Presumably process exit is going
>> to detach the DSM anyway.
>
> -- Yes considering process exit will detach the dsm, I have removed
> that function.
>
>> +    if (seg == NULL)
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                 errmsg("unable to map dynamic shared memory segment")));
>>
>> Let's use the wording from the message in parallel.c rather than this
>> wording.  Actually, we should probably (as a separate patch) change
>> test_shm_mq's worker.c to use the parallel.c wording also.
>
> -- I have corrected the message with "could not map dynamic shared
> memory segment" as in parallel.c
>
>> +    SetCurrentStatementStartTimestamp();
>>
>> Why do we need this?
>
> -- Removed Sorry forgot to remove same when I removed the SPI connection code.
>
>> +    StartTransactionCommand();
>>
>> Do we need a transaction?  If so, how about starting a new transaction
>> for each relation instead of using a single one for the entire
>> prewarm?
>
> -- We do relation_open hence need a transaction. As suggested now we
> start a new transaction on every new relation.
>
>> +    if (status == BGWH_STOPPED)
>> +        return;
>> +
>> +    if (status == BGWH_POSTMASTER_DIED)
>> +    {
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
>> +              errmsg("cannot start bgworker autoprewarm without postmaster"),
>> +                 errhint("Kill all remaining database processes and restart"
>> +                         " the database.")));
>> +    }
>> +
>> +    Assert(0);
>>
>> Instead of writing it this way, remove the first "if" block and change
>> the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
>> curly braces in this case.
>
> -- Fixed as suggested.
>
>>
>> +    file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>>
>> Use AllocateFile() instead.  See the comments for that function and at
>> the top of fd.c.  Then you don't need to worry about leaking the file
>> handle on an error, so you can remove the fclose() before ereport()
>  now> stuff -- which is incomplete anyway, because you could fail e.g.
>> inside dsm_create().
>
> -- Using AllocateFile now.
>
>>
>> +                 errmsg("Error reading num of elements in \"%s\" for"
>> +                        " autoprewarm : %m", AUTOPREWARM_FILE)));
>>
>> As I said in a previous review, please do NOT split error messages
>> across lines like this.  Also, this error message is nothing close to
>> PostgreSQL style. Please read
>> https://www.postgresql.org/docs/devel/static/error-style-guide.html
>> and learn to follow all those guidelines written therein.  I see at
>> least 3 separate problems with this message.
>
> -- Thanks, I have tried to fix it now.
>
>> +    num_elements = i;
>>
>> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
>> block dump has %d entries but expected %d", i, num_elements);  It
>> seems OK for this to be elog() rather than ereport() because the file
>> should never be corrupt unless the user has cheated by hand-editing
>> it.
>
> -- Fixed as suggested. Now eloged as an ERROR.
>
>> I think you can get rid of next_db_pos altogether, and this
>> prewarm_elem thing too.  Design sketch:
>>
>> 1. Move all the stuff that's in prewarm_elem into
>> AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
>> end_of_blockinfos to prewarm_stop_idx.
>>
>> 2. Instead of computing all of the database ranges first and then
>> launching workers, do it all in one loop.  So figure out where the
>> records for the current database end and set prewarm_start_idx and
>> prewarm_end_idx appropriately.  Launch a worker.  When the worker
>> terminates, set prewarm_start_idx = prewarm_end_idx and advance
>> prewarm_end_idx to the end of the next database's records.
>>
>> This saves having to allocate memory for the next_db_pos array, and it
>> also avoids this crock:
>>
>> +    memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));
>>
>
> -- Fixed as suggested.
>
>> The reason that's bad is because it only works so long as bgw_extra is
>> large enough to hold prewarm_elem.  If prewarm_elem grows or bgw_extra
>> shrinks, this turns into a buffer overrun.
>
> -- passing prewarm info through bgw_extra helped us to restrict the
> scope and lifetime of prewarm_elem only to prewarm task. Moving them
> to shared memory made them global even though they are not needed once
> prewarm task is finished. As there are other disadvantages of using
> bgw_extra I have now implemented as you have suggested.
>
>> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
>> the PID for the temporary file.  Otherwise, you might leave many
>> temporary files behind under different names.  If you use the same
>> name every time, you'll never have more than one, and the next
>> successful dump will end up getting rid of it along the way.
>
> -- Fixed as sugested. Previosuly PID was used so that concurrent dump
> can happen between dump worker and immediate dump as they will write
> to two different files. With new way of registering PID before file
> access in shared memory I think that problem can be adressed.
>
>>
>> +            pfree(block_info_array);
>> +            CloseTransientFile(fd);
>> +            unlink(transient_dump_file_path);
>> +            ereport(ERROR,
>> +                    (errcode_for_file_access(),
>> +                     errmsg("error writing to \"%s\" : %m",
> .> +                            AUTOPREWARM_FILE)));
>>
>> Again, this is NOT a standard error message text.  It occurs in zero
>> other places in the source tree.  You are not the first person to need
>> an error message for a failed write to a file; please look at what the
>> previous authors did.  Also, the pfree() before report is not needed;
>> isn't the whole process going to terminate? Also, you can't really use
>> errcode_for_file_access() here, because you've meanwhile done
>> CloseTransientFile() and unlink(), which will have clobbered errno.
>
> -- Removed pfree, saved errno before CloseTransientFile() and unlink()
>
>> +    ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));
>>
>> Not project style for ereport().  Line break after the first comma.
>> Similarly elsewhere.
>
> -- Tried to fix same
>
>> + *    dump_now - the main routine which goes through each buffer
>> header of buffer
>> + *    pool and dumps their meta data. We Sort these data and then dump them.
>> + *    Sorting is necessary as it facilitates sequential read during load.
>>
>> This is no longer true, because you moved the sort to the load side.
>> It's also not capitalized properly.
>
> -- Sorry removed now.
>
>> Discussions of the format of the autoprewarm dump file involve
>> inexplicably varying number of < and > symbols:
>>
>> + *        <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
>> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
>> +        buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
>
> -- Sorry fixed now, in all of the places the block info formats will
> not have such (</>) delimiter.
>
>> +#ifndef __AUTOPREWARM_H__
>> +#define __AUTOPREWARM_H__
>>
>> We don't use double-underscored names for header files.  Please
>> consult existing header files for the appropriate style.  Also, why
>> does this file exist at all, instead of just including them in the .c
>> file?  The pointer of a header is for things that need to be included
>> by multiple .c files, but there's no such need here.
>
> -- This was done to fix one of the previous review comments. I have
> moved them back to .c file.
>
>> +             * load. If there are no other block infos than the global objects
>> +             * we silently ignore them. Should I throw error?
>>
>> Silently ignoring them seems fine.  Throwing an error doesn't seem
>> like it would improve things.
>
> -- Okay thanks.
>
>>
>> +        /*
>> +         * Register a sub-worker to load new database's block. Wait until the
>> +         * sub-worker finish its job before launching next sub-worker.
>> +         */
>> +        launch_prewarm_subworker(&pelem);
>>
>> The function name implies that it launches the worker, but the comment
>> implies that it also waits for it to terminate.  Functions should be
>> named in a way that matches what they do.
>
> -- Have renamed it to launch_and_wait_for_per_database_worker
>
>> I feel like the get_autoprewarm_task() function is introducing fairly
>> baroque logic for something that really ought to be more simple.  All
>> autoprewarm_main() really needs to do is:
>>
>> if (!state->autoprewarm_done)
>>     autoprewarm();
>> dump_block_info_periodically();
>
> -- Have simplified things as suggested now. Function
> get_autoprewarm_task has been removed.
>
>> The locking in autoprewarm_dump_now() is a bit tricky.  There are two
>> trouble cases.  One is that we try to rename() our new dump file on
>> top of the existing one while a background worker is still using it to
>> perform an autoprewarm.  The other is that we try to write a new
>> temporary dump file while some other process is trying to do the same
>> thing.  I think we should fix this by storing a PID in
>> AutoPrewarmSharedState; a process which wants to perform a dump or an
>> autoprewarm must change the PID from 0 to their own PID, and change it
>> back to 0 on successful completion or error exit.  If we go to perform
>> an immediate dump process and finds a non-zero value already just does
>> ereport(ERROR, ...), including the PID of the other process in the
>> message (e.g. "unable to perform block dump because dump file is being
>> used by PID %d").  In a background worker, if we go to dump and find
>> the file in use, log a message (e.g. "skipping block dump because it
>> is already being performed by PID %d", "skipping prewarm because block
>> dump file is being rewritten by PID %d").
>
> -- Fixed as suggested.
>
>> I also think we should change is_bgworker_running to a PID, so that if
>> we try to launch a new worker we can report something like
>> "autoprewarm worker is already running under PID %d".
>
> -- Fixed. I could only "LOG" about another autoprewarm worker already
> running and then exit. Because on ERROR we try to restart the worker,
> so do not want to restart such workers.
>
>> So putting that all together, I suppose AutoPrewarmSharedState should
>> end up looking like this:
>>
>> LWLock lock; /* mutual exclusion */
>> pid_t bgworker_pid;  /* for main bgworker */
>> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */
>
> -- I think one more member is required which state whether prewarm can
> be done when the worker restarts.
>
>> /* following items are for communication with per-database worker */
>> dsm_handle block_info_handle;
>> Oid database;
>> int prewarm_start_idx;
>> int prewarm_stop_idx;
>
> -- Fixed as suggested
>
>> I suggest going through and changing "subworker" to "per-database
>> worker" throughout.
>
> -- Fixed as suggested.
>
>>
>> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?
>
> I have tried same on my local machine with ssd as a storage.
>
> settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000.
>
> Total blocks got dumped
> autoprewarm_dump_now
> ----------------------
> 1048576
>
> 5 different load time based logs
>
> 1.
> 2017-06-04 11:30:26.460 IST [116253] LOG:  autoprewarm has started
> 2017-06-04 11:30:43.443 IST [116253] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 2
> 2017-06-04 11:31:13.565 IST [116291] LOG:  autoprewarm has started
> 2017-06-04 11:31:30.317 IST [116291] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 3.
> 2017-06-04 11:32:12.995 IST [116329] LOG:  autoprewarm has started
> 2017-06-04 11:32:29.982 IST [116329] LOG:  autoprewarm load task ended
> -- 17 secs
>
> 4.
> 2017-06-04 11:32:58.974 IST [116361] LOG:  autoprewarm has started
> 2017-06-04 11:33:15.017 IST [116361] LOG:  autoprewarm load task ended
> -- 17secs
>
> 5.
> 2017-06-04 12:15:49.772 IST [117936] LOG:  autoprewarm has started
> 2017-06-04 12:16:11.012 IST [117936] LOG:  autoprewarm load task ended
> -- 22 secs.
>
> So mostly from 17 to 22 secs.
>
> But I think I need to do tests on a larger set of configuration on
> different storage types. I shall do same and upload later. I have also
> uploaded latest performance test results (on my local machine ssd
> drive)
> configuration: shared_buffer = 8GB,
> test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1
>
> TEST
> PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1
> -c 1 --select-only postgres"
> START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN
> | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME));
> echo $TIME, $TPS; done
>
>
I had a look at the patch from stylistic/formatting point of view,
please find the attached patch for the suggested modifications.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidaeis *still* broken)
Next
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Fix a typo in README.dependencies