Re: [PROPOSAL] Shared Ispell dictionaries - Mailing list pgsql-hackers

From Arthur Zakirov
Subject Re: [PROPOSAL] Shared Ispell dictionaries
Date
Msg-id 20180124172039.GA11210@zakirov.localdomain
Whole thread Raw
In response to Re: [PROPOSAL] Shared Ispell dictionaries  (Arthur Zakirov <a.zakirov@postgrespro.ru>)
Responses Re: [PROPOSAL] Shared Ispell dictionaries
Re: [PROPOSAL] Shared Ispell dictionaries
List pgsql-hackers
On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an
instance.A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all.
 

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server
start.If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which
includesOID of the dictionary.
 

Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called.
Itunpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated
sharedmemory will be released by Postgres.
 

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of
dictionariescurrently in shared memory, with the columns:
 
- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that
peoplewill haven't to do anything to allocate dictionaries in shared memory.
 

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't
forceto invalidate dictionaries cache, if I'm not mistaken.
 

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We
alreadyhave text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there
areto approach for ts_unload():
 
- use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the
drawbackthat it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping.
 
- use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires
morework to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and
someoneelse calls dsa_free() at the same time.
 

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more
memoryafter the patch.
 

Tests
-----

I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is
434e6e1484418c55561914600de9e180fc408378.I've measured french dictionary too because it has even bigger affix file than
czechdictionary.
 

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used

Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are
storedin backend's memory still.
 

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used

You can see that dictionaries now takes almost two times less memory.

pgbench with select only script:

SELECT ts_lexize('czech_hunspell', 'slon');
patch: 30431 TPS
master: 30419 TPS

SELECT ts_lexize('english_hunspell', 'elephant'):
patch: 35029 TPS
master: 35276 TPS

SELECT ts_lexize('french_hunspell', 'éléphante');
patch: 22264 TPS
master: 22744 TPS

1 - https://github.com/postgrespro/hunspell_dicts

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump
Next
From: chenhj
Date:
Subject: Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles