Re: [HACKERS] brin autosummarization -- autovacuum "work items" - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] brin autosummarization -- autovacuum "work items" |
Date | |
Msg-id | CAEepm=11ma_Z1HoPxPcSCANnh5ykHORa=HcA1U1V1+5S_jwPuA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] brin autosummarization -- autovacuum "work items" (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] brin autosummarization -- autovacuum "work items"
|
List | pgsql-hackers |
On Wed, Mar 1, 2017 at 6:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Mar 1, 2017 at 5:58 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> I think one of the most serious issues with BRIN indexes is how they >> don't get updated automatically as the table is filled. This patch >> attempts to improve on that. During brininsert() time, we check whether >> we're inserting the first item on the first page in a range. If we are, >> request autovacuum to do a summarization run on that table. This is >> dependent on a new reloption for BRIN called "autosummarize", default >> off. > > Nice. > >> The way the request works is that autovacuum maintains a DSA which can >> be filled by backends with "work items". Currently, work items can >> specify a BRIN summarization of some specific index; in the future we >> could use this framework to request other kinds of things that do not >> fit in the "dead tuples / recently inserted tuples" logic that autovac >> currently uses to decide to vacuum/analyze tables. >> >> However, it seems I have not quite gotten the hang of DSA just yet, >> because after a couple of iterations, crashes occur. I think the reason >> has to do with either a resource owner clearing the DSA at an unwelcome >> time, or perhaps there's a mistake in my handling of DSA "relative >> pointers" stuff. > > Ok, I'll take a look. It's set up for ease of use in short lifespan > situations like parallel query, and there are a few extra hoops to > jump through for longer lived DSA areas. I haven't tested this, but here is some initial feedback after reading it through once: /* * Attach to an area given a handle generated (possibly in another process) by - * dsa_get_area_handle. The area must have been created with dsa_create (not + * dsa_get_handle. The area must have been created with dsa_create (not * dsa_create_in_place). */ This is an independent slam-dunk typo fix. + /* + * Set up our DSA so that backends can install work-item requests. It may + * already exist as created by a previous launcher. + */ + if (!AutoVacuumShmem->av_dsa_handle) + { + LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE); + AutoVacuumDSA = dsa_create(LWTRANCHE_AUTOVACUUM); + AutoVacuumShmem->av_dsa_handle = dsa_get_handle(AutoVacuumDSA); + /* delay array allocation until first request */ + AutoVacuumShmem->av_workitems = InvalidDsaPointer; + LWLockRelease(AutovacuumLock); + } + else + AutoVacuumDSA = dsa_attach(AutoVacuumShmem->av_dsa_handle); I haven't looked into the autovacuum launcher lifecycle, but if it can be restarted as implied by the above then I see no reason to believe that the DSA area still exists at the point where you call dsa_attach() here. DSA areas are reference counted, so if there ever a scenario where no backend is currently attached, then it will be destroyed and this call will fail. If you want to create a DSA area that lasts until cluster shutdown even while all backends are detached, you need to call dsa_pin() after creating it. In AutoVacuumRequestWork: + AutoVacuumDSA = dsa_attach(AutoVacuumShmem->av_dsa_handle); + + if (!AutoVacuumDSA) + { + /* cannot attach? disregard request */ + LWLockRelease(AutovacuumLock); + return; + } dsa_attach either succeeds or throws, so that conditional code is unreachable. + /* XXX surely there is a simpler way to do this */ + wi_ptr = AutoVacuumShmem->av_workitems + sizeof(AutovacWorkItems) + + sizeof(AutoVacuumWorkItem) * i; + workitem = (AutoVacuumWorkItem *) dsa_get_address(AutoVacuumDSA, wi_ptr); It'd probably be simpler to keep hold of the backend-local address of the the base of the workitems array and then use regular C language facilities like array notation to work with it: workitems = &workitems[i], and then: + /* ... and put it on autovacuum's to-do list */ + workitems->avs_usedItems = wi_ptr; Considering that i is really an index into the contiguous workitems array, maybe you should really just be storing the index from i here, instead of dealing in dsa_pointers. The idea with dsa_pointers is that they're useful for complex data structures that might point into different DSM segments, like a hash table or binary tree that has internal pointers that could pointer arbitrary other objects in a data structure because it's allocated in incremental pieces. Here, you are dealing with objects in a contiguous memory space of fixed size. This leads to a bigger question about this design, which I'll ask at the end. Then at the bottom of AutoVacuumRequestWork: + LWLockRelease(AutovacuumLock); + dsa_detach(AutoVacuumDSA); + AutoVacuumDSA = NULL; +} I'm guessing that you intended to remain attached, rather than detaching at the end like this? Otherwise every backend that is inserting lots of new data attaches and detaches repeatedly, which seems unnecessary. If you do that, you'll need to run dsa_pin_mapping() after attaching, or else the DSA area will be unmapped at end of transaction and future attempts to access it will segfault. In dsm.c: - ResourceOwnerEnlargeDSMs(CurrentResourceOwner); + if (CurrentResourceOwner) + ResourceOwnerEnlargeDSMs(CurrentResourceOwner); ... and then: - seg->resowner = CurrentResourceOwner; - ResourceOwnerRememberDSM(CurrentResourceOwner, seg); + if (CurrentResourceOwner) + { + seg->resowner = CurrentResourceOwner; + ResourceOwnerRememberDSM(CurrentResourceOwner, seg); + } This makes sense. It allows DSMs (and therefore also DSA areas) to be created when you don't have a resource owner. In fact dsm.c contradicts itself sightly in this area: dsm_create() clearly believes that seg->segment can be NULL after dsm_create_descriptor() returns (see code near "too many dynamic shared memory segments" error), but dsm_create_descriptor() doesn't believe that to be the case without your patch, so perhaps this should be a separate commit to fix that rough edge. However, I think dsm_create_descriptor() still needs to assign seg->resowner even when it's NULL, otherwise it's uninitialised. My solution to this problem when I wrote a couple of different things that used long lifetime DSA areas (experimental things not posted on this list) was to define a CurrentResourceOwner with a name like "Foo Top Level", and then after creating/attaching the segment I'd call dsa_pin_mapping() which in turn calls dsm_pin_mapping() on all segments. Your solution starts out in the pinned mapping state instead (= disconnected from resource owner machinery), which is better. In AutoVacWorkerMain: + if (workitem->avw_database == MyDatabaseId && !workitem->avw_active) Stepping over already-active items in the list seems OK because the number of such items is bounded by the number of workers. Stepping over all items for other databases sounds quite expensive if it happens very often, because these are not so bounded. Ah, there can't be more than NUM_WORKITEMS, which is small. + /* + * Remove the job we just completed from the used list and put + * the array item back on the free list. + */ + workitems->avs_usedItems = workitem->avw_next; Isn't this corrupting the list avs_usedItems queue if avw_next points to an item that some other worker has removed from the list while we were working on our item? Stepping back from the code a bit: What is your motivation for using DSA? It seems you are creating an area and then using it to make exactly one allocation of a constant size known up front to hold your fixed size workitems array. You don't do any dynamic allocation at runtime, apart from the detail that it happens to allocated on demand. Perhaps it would make sense if you had a separate space per database or something like that, so that the shared memory need would be dynamic? It looks like outstanding autosummarisation work will be forgotten if you restart before it is processed. Over in another thread[1] we exchanged emails on another way to recognise that summarisation work needs to be done, if we are only interested in unsummarised ranges at the end of the heap. I haven't studied BRIN enough to know if that is insufficient: can you finish up with unsummarised ranges not in a contiguous range at the end of the heap? If so, perhaps the BRIN index itself should also have a way to record that certain non-final ranges are unsummarised but should be summarised asynchronously? Then the system could be made to behave exactly the same way no matter when reboots occur, which seems like a good property. [1] https://www.postgresql.org/message-id/20170130191640.2johoyume5v2dbbq%40alvherre.pgsql -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: