Re: [PROPOSAL] Shared Ispell dictionaries - Mailing list pgsql-hackers
From | Arthur Zakirov |
---|---|
Subject | Re: [PROPOSAL] Shared Ispell dictionaries |
Date | |
Msg-id | 20180124172039.GA11210@zakirov.localdomain Whole thread Raw |
In response to | Re: [PROPOSAL] Shared Ispell dictionaries (Arthur Zakirov <a.zakirov@postgrespro.ru>) |
Responses |
Re: [PROPOSAL] Shared Ispell dictionaries
Re: [PROPOSAL] Shared Ispell dictionaries |
List | pgsql-hackers |
On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote: > I think your proposals may be implemented in several patches, so they can > be applyed independently but consistently. I suppose I will prepare new > version of the patch with fixes and with initial design of new functions > and commands soon. I attached new version of the patch. 0001-Fix-ispell-memory-handling-v3.patch: > allocate memory in the buildCxt. What about adding tmpstrdup to copy a > string into the context? I admit this is mostly nitpicking though. Fixed. Added tmpstrdup. 0002-Retreive-shmem-location-for-ispell-v3.patch: dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an instance.A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all. Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server start.If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which includesOID of the dictionary. Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called. Itunpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated sharedmemory will be released by Postgres. 0003-Store-ispell-structures-in-shmem-v3.patch: Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid. 0004-Update-tmplinit-arguments-v3.patch: It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit. tmplinit has the following signature now: dict_init(internal, internal) 0005-pg-ts-shared-dictinaries-view-v3.patch: Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of dictionariescurrently in shared memory, with the columns: - dictoid - schemaname - dictname - size 0006-Shared-memory-ispell-option-v3.patch: Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that peoplewill haven't to do anything to allocate dictionaries in shared memory. Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't forceto invalidate dictionaries cache, if I'm not mistaken. > 3) How do I unload a dictionary from the shared memory? > ... > ALTER TEXT SEARCH DICTIONARY x UNLOAD > > 4) How do I reload a dictionary? > ... > ALTER TEXT SEARCH DICTIONARY x RELOAD I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We alreadyhave text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there areto approach for ts_unload(): - use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the drawbackthat it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping. - use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires morework to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and someoneelse calls dsa_free() at the same time. > 7) You mentioned you had to get rid of the compact_palloc0 - can you > elaborate a bit why that was necessary? Also, when benchmarking the > impact of this make sure to measure not only the time but also memory > consumption. It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more memoryafter the patch. Tests ----- I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is 434e6e1484418c55561914600de9e180fc408378.I've measured french dictionary too because it has even bigger affix file than czechdictionary. With patch: czech_hunspell - 247 ms english_hunspell - 59 ms french_hunspell - 103 ms Master: czech_hunspell - 224 ms english_hunspell - 52 ms french_hunspell - 101 ms Memory: With patch (shared memory size + backend's memory): czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are storedin backend's memory still. Master (backend's memory): czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used You can see that dictionaries now takes almost two times less memory. pgbench with select only script: SELECT ts_lexize('czech_hunspell', 'slon'); patch: 30431 TPS master: 30419 TPS SELECT ts_lexize('english_hunspell', 'elephant'): patch: 35029 TPS master: 35276 TPS SELECT ts_lexize('french_hunspell', 'éléphante'); patch: 22264 TPS master: 22744 TPS 1 - https://github.com/postgrespro/hunspell_dicts -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
pgsql-hackers by date: