Re: [PROPOSAL] Shared Ispell dictionaries - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PROPOSAL] Shared Ispell dictionaries |
Date | |
Msg-id | d12d9395-922c-64c9-c87d-dd0e1d31440e@2ndquadrant.com Whole thread Raw |
In response to | Re: [PROPOSAL] Shared Ispell dictionaries (Arthur Zakirov <a.zakirov@postgrespro.ru>) |
Responses |
Re: [PROPOSAL] Shared Ispell dictionaries
|
List | pgsql-hackers |
Hi Arthur, I've done some initial review of the patch today, and here are some thoughts: 0001-Fix-ispell-memory-handling-v2.patch This makes sense. The patch simply replaces two cpstrdup() calls with MemoryContextStrdup, but I see spell.c already has two macros to allocate memory in the buildCxt. What about adding tmpstrdup to copy a string into the context? I admit this is mostly nitpicking though. 0002-Retreive-shmem-location-for-ispell-v2.patch I think the GUC name should make it clear it's a maximum number of something, just like "max_parallel_workers" and other such GUCs. When I first saw "shared_dictionaries" in the patch I thought it's a list of dictionary names, or something like that. I have a bunch of additional design questions and proposals (not necessarily required for v1, but perhaps useful for shaping it). 1) Why do we actually need the limit? Is it really necessary / useful? When I wrote shared_ispell back in 2012, all we had were fixed segments allocated at start, and so similar limits were a built-in restriction. But after the DSM stuff was introduced I imagined it would not be necessary. I realize the current implementation requires that, because the hash table is still created in an old-style memory context (and only the dictionaries are in DSM segments). But that seems fairly straightforward to fix by maintaining the hash table in a separate DSM segment too. So lookup of the dictionary DSM would have to fist check what the current hash table segment is, and then continue as now. I'm not sure if dynahash can live in a DSM segment, but we already have a hash table that supports that in dshash.c (which is also concurrent, although I'm not sure if that's a major advantage for this use case). 2) Do we actually want/need some limits? Which ones? That is not to say we don't need/want some limits, but the current limit may not be the droid we're looking for, for a couple of reasons. Firstly, currently it only matters during startup, when the dynahash is created. So to change the limit (e.g. to increase it) you actually have to restart the database, which is obviously a major hassle. Secondly, dynahash tweaks the values to get proper behavior. For example it's not using the values directly but some higher value of 2^N form. Which means the limit may not enforced immediately when hitting the GUC, but unexpectedly somewhat later. And finally, I believe this is log-worthy - right now the dictionary load silently switches to backend memory (thus incurring all the parsing overhead). This certainly deserves at least a log message. Actually, I'm not sure "number of dictionaries" is a particularly useful limit in the first place - that's not a number I really care about. But I do care about amount of memory consumed by the loaded dictionaries. So I do suggest adding such "max memory for shared dictionaries" limit. I'm not sure we can enforce it strictly, because when deciding where to load the dict we haven't parsed it yet and so don't know how much memory will be required. But I believe a lazy check should be fine (load it, and if we exceeded the total memory disable loading additional ones). 3) How do I unload a dictionary from the shared memory? Assume we've reached the limit (it does not matter if it's the number of dictionaries or memory used by them). How do I resolve that without restarting the database? How do I unload a dictionary (which may be unused) from shared memory? ALTER TEXT SEARCH DICTIONARY x UNLOAD 4) How do I reload a dictionary? Assume I've updated the dictionary files (added new words into the files, or something like that). How do I reload the dictionary? Do I have to restart the server, DROP/CREATE everything again, or what? What about instead having something like this: ALTER TEXT SEARCH DICTIONARY x RELOAD 5) Actually, how do I list currently loaded dictionaries (and how much memory they use in the shared memory)? 6) What other restrictions would be useful? I think it should be possible to specify which ispell dictionaries may be loaded into shared memory, and which should be always loaded into local backend memory. That is, something like CREATE TEXT SEARCH DICTIONARY x ( TEMPLATE = ispell, DictFile = czech, AffFile = czech, StopWords = czech, SharedMemory = true/false (default: false) ); because otherwise the dictionaries will compete for shared memory, and it's unclear which of them will get loaded. For a server with a single application that may not be a huge issue, but think about servers shared by multiple applications, etc. In the extension this was achieved kinda explicitly by definition of a separate 'shared_ispell' template, but if you modify the current one that won't work, of course. 7) You mentioned you had to get rid of the compact_palloc0 - can you elaborate a bit why that was necessary? Also, when benchmarking the impact of this make sure to measure not only the time but also memory consumption. In fact, that was the main reason why Pavel implemented it in 2010, because the czech dictionary takes quite a bit of memory, and without the shared memory a copy was kept in every backend. Of course, maybe that would be mostly irrelevant thanks to this patch (due to changes to the representation and keeping just a single copy). 8) One more thing - I've noticed that the hash table uses this key: typedef struct { char dictfile[MAXPGPATH]; char afffile[MAXPGPATH]; } TsearchDictKey; That is, full paths to the two files, and I'm not sure that's a very good idea. Firstly, it's a bit wasteful (1kB per path). But more importantly it means all dictionaries referencing the same files will share the same chunk of shared memory - not only within a single database, but across the whole cluster. That may lead to surprising behavior, because e.g. unloading a dictionary in one database will affect dictionaries in all other databases referencing the same files. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: