Thread: [PROPOSAL] Shared Ispell dictionaries

[PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

26 December 2017, 19:48:27

Hello, hackers!

Introduction
------------

I'm going to implement a patch which will store Ispell dictionaries in a shared memory.

There is an extension shared_ispell [1], developed by Tomas Vondra. But it is a bad candidate for including into
contrib.
Because it should know a lot of information about IspellDict struct to copy it into a shared memory.

Why
---

Shared Ispell dictionary gives the following improvements:
- consume less memory - Ispell dictionary loads into memory for every backends and requires for some dictionaries more
than100Mb
 
- there is no overhead during first call of a full text search function (such as to_tsvector(), to_tsquery())

Implementation
--------------

It is necessary to change all structures related with IspellDict: SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict
itself.They all shouldn't use pointers for this reason. Others are used only during dictionary building.
 
It would be good to store in a shared memory StopList struct too.

All fields of IspellDict struct, which are used only during dictionary building, will be move into new IspellDictBuild
todecrease needed shared memory size. And they are going to be released by buildCxt.
 

Each dictionary will be stored in its own dsm segment. Structures for regular expressions won't be stored in a shared
memory.They are compiled for every backend.
 

The patch will be ready and added into the 2018-03 commitfest.

Thank you for your attention. Any thoughts?


1 - github.com/tvondra/shared_ispell or github.com/postgrespro/shared_ispell

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Alvaro Herrera

Date:

26 December 2017, 19:55:57

Arthur Zakirov wrote:

> Implementation
> --------------
> 
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01?  Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Pavel Stehule

Date:

26 December 2017, 21:03:48

2017-12-26 17:55 GMT+01:00 Alvaro Herrera <alvherre@alvh.no-ip.org>:

Arthur Zakirov wrote:

> Implementation
> --------------
>
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01? Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

Tomas had some workable patches related to this topic

Regards

Pavel

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

26 December 2017, 21:22:38

Thank you for your feedback.

On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> So what are you going to use instead?

For example, AffixNode and AffixNodeData represent prefix tree of an
affix list. They are accessed by Suffix and Prefix pointers of
IspellDict struct now. Instead all affix nodes should be placed into an
array and accessed by an offset. Suffix array goes first, Prefix array
goes after. AffixNodeData will access to a child node by an offset too.

AffixNodeData struct has the array of pointers to AFFIX struct. These
array with all AFFIX data can be stored within AffixNodeData. Or
AffixNodeData can have an array of indexes to a single AFFIX array,
which stored within IspellDict before or after Suffix and Prefix.

Same for prefix tree of a word list, represented by SPNode struct. It
might by stored as an array after the Prefix array.

AffixData and CompoundAffix arrays go after them.

To allocate IspellDict in this case it is necessary to calculate needed
memory size. I think arrays mentioned above will be built first then
memcpy'ed into IspellDict, if it won't take much time.

Hope it makes sense and is reasonable.

> 
> So this will be a large patch not submitted to 2018-01?  Depending on
> size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> too late.
> 

Oh, I see. I try to prepare the patch while 2018-01 is open.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Alvaro Herrera

Date:

26 December 2017, 21:58:53

Arthur Zakirov wrote:

> On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> > So what are you going to use instead?
>
> [ ... ]
>
> To allocate IspellDict in this case it is necessary to calculate needed
> memory size. I think arrays mentioned above will be built first then
> memcpy'ed into IspellDict, if it won't take much time.

OK, that sounds sensible on first blush.  If there are many processes
concurrently doing text searches, then the amnount of memory saved may
be large enough to justify the additional processing (moreso if it's
just one more memcpy cycle).

I hope that there is some way to cope with the ispell data changing
underneath -- maybe you'll need some sort of RCU?

> > So this will be a large patch not submitted to 2018-01?  Depending on
> > size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> > too late.
> 
> Oh, I see. I try to prepare the patch while 2018-01 is open.

It isn't necessary that the patch to present to 2018-01 is final and
complete (so don't kill yourself to achieve that) -- a preliminary patch
that reviewers can comment on is enough, as long as the final patch you
present to 2018-03 is not *too* different.  But any medium-large patch
whose first post is to the last commitfest of a cycle is likely to be
thrown out to the next cycle's first commitfest very quickly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

27 December 2017, 12:20:00

On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
> 
> Tomas had some workable patches related to this topic
> 

Tomas, have you planned to propose it?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

31 December 2017, 18:28:13

Hello, hackers,

On Tue, Dec 26, 2017 at 07:48:27PM +0300, Arthur Zakirov wrote:
> The patch will be ready and added into the 2018-03 commitfest.
> 

I attached the patch itself.

0001-Fix-ispell-memory-handling.patch:

Some strings are allocated via compact_palloc0(). But they are not
persistent, so they should be allocated using temporary memory context.
Also a couple strings are not released if .aff file had new format.

0002-Retreive-shmem-location-for-ispell.patch:

Adds ispell_shmem_location() function which look for location for a
dictionary using .dict and .aff file names. If the location haven't been
allocated in DSM earlier, allocate it. Shared hash table is used here to
search the location.

Maximum number of elements of hash table is NUM_DICTIONARIES=20 now. It
will be better to use a GUC-variable. Also if the number of elements
reached the limit then it will be good to use backend's local memory
instead of shared.

0003-Store-ispell-structures-in-shmem.patch:

Introduces IspellDictBuild and IspellDictData structures, removes
IspellDict structure. IspellDictBuild is used during building the
dictionary, if it haven't been allocated in DSM earlier, within
dispell_build() function. IspellDictBuild has a pointer to
IspellDictData structure, which will be filled with persistent data.

After building the dictionary IspellDictData is copied into
DSM location and temporary data of IspellDictBuild is released.

All prefix trees are stored as a flat array now. Those arrays are
allocated and stored using NodeArray struct now. Required node can be
retreied by node offset. AffixData and Affix arrays have additional
offset array to retreive an element by index.

Affix field (array of AFFIX) of IspellDictBuild is persistent data also. But it is
constructed as a temporary array first, Affix array need to be sorted
via qsort() within NISortAffixes().

So IspellDictData stores:
- AffixData - array of strings, access via AffixDataOffset
- Affix - array of AFFIX, access via AffixOffset
- DictNodes, PrefixNodes, SuffixNodes - prefix trees as a plain array
- CompoundAffix - array of CMPDAffix sequential access

I had to remove compact_palloc0() added by Pavel in
3e5f9412d0a818be77c974e5af710928097b91f3. Ispell dictionary doesn't need
such allocation anymore. It was used to allocate a little locations. I
will definity check performance of Czech dictionary.

There are issues to do:
- add the GUC-variable for hash table limit
- fix bugs
- improve comments
- performance testing

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

07 January 2018, 22:05:27

On Sun, Dec 31, 2017 at 06:28:13PM +0300, Arthur Zakirov wrote:
> 
> There are issues to do:
> - add the GUC-variable for hash table limit
> - fix bugs
> - improve comments
> - performance testing
> 

Here is the second version of the patch.

0002-Retreive-shmem-location-for-ispell-v2.patch:

Fixed some bugs and added the GUC variable "shared_dictionaries".

Added documentation for it. I'm not sure about the order of configuration parameters in section "19.4.1.
Memory". Now "shared_dictionaries" goes after "shared_buffers". Maybe it
will be good to make a patch wich will sort parameters in alphabetical
order?

0003-Store-ispell-structures-in-shmem-v2.patch:

Fixed some bugs, regression tests pass now. I added more comments
and fixed old. I also tested with Hunspell dictionaries [1]. They are
good too.

Results of performance testing of Ispell and Hunspell dictionaries will
be ready soon.

1 - github.com/postgrespro/hunspell_dicts

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an
instance.A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all.

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server
start.If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which
includesOID of the dictionary.

Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called.
Itunpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated
sharedmemory will be released by Postgres.

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of
dictionariescurrently in shared memory, with the columns:

- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that
peoplewill haven't to do anything to allocate dictionaries in shared memory.

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't
forceto invalidate dictionaries cache, if I'm not mistaken.

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We
alreadyhave text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there
areto approach for ts_unload():

- use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the
drawbackthat it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping.

- use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires
morework to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and
someoneelse calls dsa_free() at the same time.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more
memoryafter the patch.

Tests
-----

I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is
434e6e1484418c55561914600de9e180fc408378.I've measured french dictionary too because it has even bigger affix file than
czechdictionary.

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used

Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are
storedin backend's memory still.

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used

You can see that dictionaries now takes almost two times less memory.

pgbench with select only script:

SELECT ts_lexize('czech_hunspell', 'slon');
patch: 30431 TPS
master: 30419 TPS

SELECT ts_lexize('english_hunspell', 'elephant'):
patch: 35029 TPS
master: 35276 TPS

SELECT ts_lexize('french_hunspell', 'éléphante');
patch: 22264 TPS
master: 22744 TPS

1 - https://github.com/postgrespro/hunspell_dicts

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

24 January 2018, 20:57:08

Hi,

On 01/24/2018 06:20 PM, Arthur Zakirov wrote:
> On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
>> I think your proposals may be implemented in several patches, so
>> they can be applyed independently but consistently. I suppose I
>> will prepare new version of the patch with fixes and with initial
>> design of new functions and commands soon.
> 
> I attached new version of the patch.
> 

Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

>> 3) How do I unload a dictionary from the shared memory?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>>
>> 4) How do I reload a dictionary?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x RELOAD
> 
> I thought about it. And it seems to me that we can use functions 
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

> I think there are to approach for ts_unload():> - use DSM's pin and unpin methods and the invalidation callback, as
> it done during fixing memory leak. It has the drawback that it won't
> have an immediate effect, because DSM will be released only when all
> backends unpin DSM mapping.
> - use DSA and dsa_free() method. As far as I understand dsa_free() 
> frees allocated memory immediatly. But it requires more work to do, 
> because we will need some more locks. For instance, what happens
> when someone calls ts_lexize() and someone else calls dsa_free() at
> the same time.

No opinion on this yet, I have to think about it for a bit and look at
the code first.

>> 7) You mentioned you had to get rid of the compact_palloc0 - can you
>> elaborate a bit why that was necessary? Also, when benchmarking the
>> impact of this make sure to measure not only the time but also memory
>> consumption.
> 
> It seems to me that there is no need compact_palloc0() anymore. Tests
> show that czech dictionary doesn't consume more memory after the
> patch.
> 

That's interesting. I'll do some additional tests to verify the finding.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

24 January 2018, 21:35:41

2018-01-24 20:57 GMT+03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

Thank you for your thoughts.

> I thought about it. And it seems to me that we can use functions
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

I understood your opinion. I haven't strong opinion on the subject yet.

And I agree that they can be implemented in future improvements for shared

dictionaries.

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Ildus Kurbangaliev

Date:

25 January 2018, 15:26:46

On Wed, 24 Jan 2018 20:20:41 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

Hi, I did some review of the patch.

In 0001 there are few lines where is only indentation has changed.

0002:
- TsearchShmemSize - calculating size using hash_estimate_size seems
redundant since you use DSA hash now.
- ts_dict_shmem_release - LWLockAcquire in the beginning makes no
  sense, since dict_table couldn't change anyway.

0003:
- ts_dict_shmem_location could return IspellDictData, it makes more
  sense.

0006:
It's very subjective, but I think it would nicer to call option as
Shared (as property of dictionary) or UseSharedMemory, the boolean
option called SharedMemory sounds weird.

Overall the patches look good, all tests passed. I tried to broke it in
few places where I thought it could be unsafe but not succeeded.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

25 January 2018, 19:51:58

Hello,

Thank you for your review! Good catches.

On Thu, Jan 25, 2018 at 03:26:46PM +0300, Ildus Kurbangaliev wrote:
> In 0001 there are few lines where is only indentation has changed.

Fixed.

> 0002:
> - TsearchShmemSize - calculating size using hash_estimate_size seems
> redundant since you use DSA hash now.

Fixed. True, there is no need in hash_estimate_size anymore.

> - ts_dict_shmem_release - LWLockAcquire in the beginning makes no
>   sense, since dict_table couldn't change anyway.

Fixed. In earlier version tsearch_ctl was used here, but I forgot to remove LWLockAcquire.

> 0003:
> - ts_dict_shmem_location could return IspellDictData, it makes more
>   sense.

I assume that ts_dict_shmem_location can be used by various types of dictionaries, not only by Ispell. So void * more
suitablehere.

> 0006:
> It's very subjective, but I think it would nicer to call option as
> Shared (as property of dictionary) or UseSharedMemory, the boolean
> option called SharedMemory sounds weird.

Agree. In our offline conversation we came to Shareable, that is a dictionary can be shared. It may be more appropriate
becausesetting Shareable=true doesn't guarantee that a dictionary will be allocated in shared memory due to
max_shared_dictionaries_sizeGUC.

Attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

07 February 2018, 19:28:29

On Thu, Jan 25, 2018 at 07:51:58PM +0300, Arthur Zakirov wrote:
> Attached new version of the patch.

Here is rebased version of the patch due to changes into dict_ispell.c.
The patch itself wasn't changed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

On Wed, Feb 07, 2018 at 07:28:29PM +0300, Arthur Zakirov wrote:
> Here is rebased version of the patch due to changes into dict_ispell.c.
> The patch itself wasn't changed.

Here is rebased version of the patch due to changes within pg_proc.h.
I haven't implemented a mmap prototype yet, though.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

2018-03-07 12:55 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:

On Wed, Mar 07, 2018 at 10:55:29AM +0100, Tomas Vondra wrote:
> On 03/07/2018 09:55 AM, Arthur Zakirov wrote:
> > Hello Andres,
> >
> > On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
> >> Is there any chance we can instead can convert dictionaries into a form
> >> we can just mmap() into memory? That'd scale a lot higher and more
> >> dynamicallly?
> >
> > To avoid misunderstanding can you please elaborate on using mmap()? The
> > DSM approach looks like more simple and requires less code. Also DSM may
> > use mmap() if I'm not mistaken.
> >
>
> I think the mmap() idea is that you preprocess the dictionary, store the
> result in a file, and then mmap it when needed, without the expensive
> preprocessing.

Understand. I'm not againts the mmap() approach, just I have lack of
understanding mmap() benefits... Current shared Ispell approach requires
preprocessing after server restarting, and the main advantage of mmap() here
is that mmap() doesn't require preprocessing after restarting.

Speaking about the implementation.

It seems that the most appropriate place to store preprocessed files is
'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
dsm_cleanup_for_mmap() will remove them.

I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
maybe it's worth to reuse them.

I don't think so serialization to file (mmap) has not too sense. But the shared dictionary should loaded every time, and should be released every time if it is possible.Maybe there can be some background worker, that holds dictionary in memory.

Regards

Pavel

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

07 March 2018, 15:43:54

On Wed, Mar 07, 2018 at 01:02:07PM +0100, Pavel Stehule wrote:
> > Understand. I'm not againts the mmap() approach, just I have lack of
> > understanding mmap() benefits... Current shared Ispell approach requires
> > preprocessing after server restarting, and the main advantage of mmap()
> > here
> > is that mmap() doesn't require preprocessing after restarting.
> >
> > Speaking about the implementation.
> >
> > It seems that the most appropriate place to store preprocessed files is
> > 'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
> > dsm_cleanup_for_mmap() will remove them.
> >
> > I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
> > maybe it's worth to reuse them.
> >
> 
> I don't think so serialization to file (mmap) has not too sense. But the
> shared dictionary should loaded every time, and should be released every
> time if it is possible.Maybe there can be some background worker, that
> holds dictionary in memory.

Do you mean that a shared dictionary should be reloaded if its .affix
and .dict files was changed? IMHO we can store last modification
timestamp of them in a preprocessed file, and then we can rebuild the
dictionary if files was changed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Pavel Stehule

Date:

07 March 2018, 15:47:25

2018-03-07 13:43 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:

On Wed, Mar 07, 2018 at 01:02:07PM +0100, Pavel Stehule wrote:
> > Understand. I'm not againts the mmap() approach, just I have lack of
> > understanding mmap() benefits... Current shared Ispell approach requires
> > preprocessing after server restarting, and the main advantage of mmap()
> > here
> > is that mmap() doesn't require preprocessing after restarting.
> >
> > Speaking about the implementation.
> >
> > It seems that the most appropriate place to store preprocessed files is
> > 'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
> > dsm_cleanup_for_mmap() will remove them.
> >
> > I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
> > maybe it's worth to reuse them.
> >
>
> I don't think so serialization to file (mmap) has not too sense. But the
> shared dictionary should loaded every time, and should be released every
> time if it is possible.Maybe there can be some background worker, that
> holds dictionary in memory.

Do you mean that a shared dictionary should be reloaded if its .affix
and .dict files was changed? IMHO we can store last modification
timestamp of them in a preprocessed file, and then we can rebuild the
dictionary if files was changed.

No, it is not necessary - just there should be commands (functions) for preload dictiory and unload dictionary.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

07 March 2018, 15:58:48

On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
> 
> No, it is not necessary - just there should be commands (functions)  for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Pavel Stehule

Date:

07 March 2018, 16:10:12

2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:

On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
>
> No, it is not necessary - just there should be commands (functions) for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

Postgres hasn't any subsystem based on modification time, so introduction this sensitivity, I don't see, practical.

Regards

Pavel

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Pavel Stehule

Date:

07 March 2018, 16:12:32

2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:

2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
>
> No, it is not necessary - just there should be commands (functions) for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

Postgres hasn't any subsystem based on modification time, so introduction this sensitivity, I don't see, practical.

Usually the shared dictionaries are used for complex language based fulltext. The frequence of updates of these dictionaries is less than updates PostgreSQL. The czech dictionary is same 10 years.

Regards

Pavel

Regards

Pavel

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

07 March 2018, 16:18:20

On Wed, Mar 07, 2018 at 02:12:32PM +0100, Pavel Stehule wrote:
> 2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:
> > 2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
> >> Oh understood. Tomas suggested those commands too earlier. I'll
> >> implement them. But I think it is better to track files modification time
> >> too. Because now, without the patch, users don't have to call additional
> >> commands to refresh their dictionaries, so without such tracking we'll
> >> made dictionaries maintenance harder.
> >>
> >
> > Postgres hasn't any subsystem based on modification time, so introduction
> > this sensitivity, I don't see, practical.
> >
> 
> Usually the shared dictionaries are used for complex language based
> fulltext. The frequence of updates of these dictionaries is less than
> updates PostgreSQL. The czech dictionary is same 10 years.

Agree. In this case auto reloading isn't important feature here.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

17 March 2018, 04:29:31

On 03/07/2018 02:18 PM, Arthur Zakirov wrote:
> On Wed, Mar 07, 2018 at 02:12:32PM +0100, Pavel Stehule wrote:
>> 2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:
>>> 2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
>>>> Oh understood. Tomas suggested those commands too earlier. I'll
>>>> implement them. But I think it is better to track files modification time
>>>> too. Because now, without the patch, users don't have to call additional
>>>> commands to refresh their dictionaries, so without such tracking we'll
>>>> made dictionaries maintenance harder.
>>>>
>>>
>>> Postgres hasn't any subsystem based on modification time, so
>>> introduction this sensitivity, I don't see, practical.
>>>
>>
>> Usually the shared dictionaries are used for complex language
>> based fulltext. The frequence of updates of these dictionaries is
>> less than updates PostgreSQL. The czech dictionary is same 10
>> years.
> 
> Agree. In this case auto reloading isn't important feature here.
> 

Arthur, what are your plans with this patch in the current CF?

It does not seem to be moving towards RFC very much, and reworking the
patch to use mmap() seems like a quite significant change late in the
CF. Which means it's likely to cause the patch get get bumped to the
next CF (2018-09).

FWIW I am not quite sure if the mmap() approach is better than what was
implemented by the patch. I'm not sure how exactly will it behave under
memory pressure (AFAIK it goes through page cache, which means random
parts of dictionaries might get evicted) or how well is it supported on
various platforms (say, Windows).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

17 March 2018, 07:43:39

Hello Tomas,

Arthur, what are your plans with this patch in the current CF?

I think dsm-based approach is in good shape already and works nice.

I've planned only to improve the documentation a little. Also it seems I should change 0004 part, I found that extension upgrade scripts may be made in wrong way.

In my opinion RELOAD and UNLOAD commands can be made in next commitfest (2018-09).

Did you look it? Have you arguments about how shared memory allocation and releasing functions are made?

It does not seem to be moving towards RFC very much, and reworking the
patch to use mmap() seems like a quite significant change late in the
CF. Which means it's likely to cause the patch get get bumped to the
next CF (2018-09).

Agree. I have a draft version for mmap-based approach which works in platforms with mmap. In Windows it is necessary to use another API (CreateFileMapping, etc). But this approach requires more work on handling processed dictionary files (how name them, when remove).

FWIW I am not quite sure if the mmap() approach is better than what was
implemented by the patch. I'm not sure how exactly will it behave under
memory pressure (AFAIK it goes through page cache, which means random
parts of dictionaries might get evicted) or how well is it supported on
various platforms (say, Windows).

Yes, as I wrote mmap-based approach requires more work. The only benefit I see is that you don't need to process a dictionary after server restart. I'd vote for dsm-based approach.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

19 March 2018, 03:52:41

On 03/17/2018 05:43 AM, Arthur Zakirov wrote:
> Hello Tomas,
> 
>     Arthur, what are your plans with this patch in the current CF?
> 
> 
> I think dsm-based approach is in good shape already and works nice.
> I've planned only to improve the documentation a little. Also it seems I
> should change 0004 part, I found that extension upgrade scripts may be
> made in wrong way.
> In my opinion RELOAD and UNLOAD commands can be made in next commitfest
> (2018-09).
> Did you look it? Have you arguments about how shared memory allocation
> and releasing functions are made?
>  
> 
> 
>     It does not seem to be moving towards RFC very much, and reworking the
>     patch to use mmap() seems like a quite significant change late in the
>     CF. Which means it's likely to cause the patch get get bumped to the
>     next CF (2018-09).
> 
> 
> Agree. I have a draft version for mmap-based approach which works in
> platforms with mmap. In Windows it is necessary to use another API
> (CreateFileMapping, etc). But this approach requires more work on
> handling processed dictionary files (how name them, when remove).
>  
> 
> 
>     FWIW I am not quite sure if the mmap() approach is better than what was
>     implemented by the patch. I'm not sure how exactly will it behave under
>     memory pressure (AFAIK it goes through page cache, which means random
>     parts of dictionaries might get evicted) or how well is it supported on
>     various platforms (say, Windows).
> 
> 
> Yes, as I wrote mmap-based approach requires more work. The only
> benefit I see is that you don't need to process a dictionary after
> server restart.  I'd vote for dsm-based approach.
> 

I do agree with that. We have a working well-understood dsm-based
solution, addressing the goals initially explained in this thread.

I don't see a reason to stall this patch based on a mere assumption that
the mmap-based approach might be magically better in some unknown
aspects. It might be, but we may as well leave that as a future work.

I wonder how much of this patch would be affected by the switch from dsm
to mmap? I guess the memory limit would get mostly irrelevant (mmap
would rely on the OS to page the memory in/out depending on memory
pressure), and so would the UNLOAD/RELOAD commands (because each backend
would do it's own mmap).

In any case, I suggest to polish the dsm-based patch, and see if we can
get that one into PG11.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Andres Freund

Date:

19 March 2018, 04:34:38

Hi,

On 2018-03-19 01:52:41 +0100, Tomas Vondra wrote:
> I do agree with that. We have a working well-understood dsm-based
> solution, addressing the goals initially explained in this thread.

Well, it's also awkward and manual to use. I do think that's something
we've to pay attention to.


> I wonder how much of this patch would be affected by the switch from dsm
> to mmap? I guess the memory limit would get mostly irrelevant (mmap
> would rely on the OS to page the memory in/out depending on memory
> pressure), and so would the UNLOAD/RELOAD commands (because each backend
> would do it's own mmap).

Those seem fairly major.

Greetings,

Andres Freund

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

19 March 2018, 14:06:50

Arthur Zakirov wrote:
> I've planned only to improve the documentation a little. Also it seems I
> should change 0004 part, I found that extension upgrade scripts may be made
> in wrong way.

I've attached new version of the patch. In this version I removed
0004-Update-tmplinit-arguments-v6.patch. In my opinion it handled
extensions upgrade in wrong way. If I'm not mistaken currently there is
no way to upgrade a template's init function signature. And I didn't
find way to change init_method(internal) to init_method(internal,
internal) within an extension's upgrade script.

Therefore I added 0002-Change-tmplinit-argument-v7.patch. Now
DictInitData struct is passed in a template's init method. It contains
necessary data: dictoptions and dictid. And there is no need to change
the method's signature.

Other parts of the patch are same, except that they use DictInitData
structure now.

On Mon, Mar 19, 2018 at 01:52:41AM +0100, Tomas Vondra wrote:
> I wonder how much of this patch would be affected by the switch from dsm
> to mmap? I guess the memory limit would get mostly irrelevant (mmap
> would rely on the OS to page the memory in/out depending on memory
> pressure), and so would the UNLOAD/RELOAD commands (because each backend
> would do it's own mmap).

I beleive mmap requires completely rewrite 0003 part of the patch and a
little changes in 0005.

> In any case, I suggest to polish the dsm-based patch, and see if we can
> get that one into PG11.

Yes we have more time in future commitfests if dsm-based patch won't be
approved.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Hi Arthur,

I went through the patch - just skimming through the diffs, will do more
testing tomorrow. Here are a few initial comments.

1) max_shared_dictionaries_size / PGC_POSTMASTER

I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
can't be changed after server start. That seems like a fairly useful
thing to do (e.g. increase the limit while the server is running), and
after looking at the code I think it shouldn't be difficult to change.

The other thing I'd suggest is handling "-1" as "no limit".


2) max_shared_dictionaries_size / size of number

Some of the comments dealing with the GUC treat it as a number of
dictionaries (instead of a size). I suppose that's due to how the
original patch was implemented.


3) Assert(max_shared_dictionaries_size);

I'd say that assert is not very clear - it should be

    Assert(max_shared_dictionaries_size > 0);

or something along the lines. It's also a good idea to add a comment
explaining the assert, say

    /* we can only get here when shared dictionaries are enabled */
    Assert(max_shared_dictionaries_size > 0);

4) I took the liberty of rewording some of the docs/comments. See the
attached diffs, that should apply on top of 0003 and 0004 patches.
Please, treat those as mere suggestions.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

20 March 2018, 16:11:51

Hello,

On Mon, Mar 19, 2018 at 08:50:46PM +0100, Tomas Vondra wrote:
> Hi Arthur,
> 
> I went through the patch - just skimming through the diffs, will do more
> testing tomorrow. Here are a few initial comments.

Thank you for the review!

> 1) max_shared_dictionaries_size / PGC_POSTMASTER
> 
> I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
> can't be changed after server start. That seems like a fairly useful
> thing to do (e.g. increase the limit while the server is running), and
> after looking at the code I think it shouldn't be difficult to change.

max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
of a new value to disallow to set zero if there are loaded dictionaries
and to decrease maximum allowed size if loaded size is greater than the
new value.

> The other thing I'd suggest is handling "-1" as "no limit".

I added availability to set '-1'. Fixed some comments and the
documentation.

> 2) max_shared_dictionaries_size / size of number
> 
> Some of the comments dealing with the GUC treat it as a number of
> dictionaries (instead of a size). I suppose that's due to how the
> original patch was implemented.

Fixed. Should be good now.

> 3) Assert(max_shared_dictionaries_size);
> 
> I'd say that assert is not very clear - it should be
> 
>     Assert(max_shared_dictionaries_size > 0);
> 
> or something along the lines. It's also a good idea to add a comment
> explaining the assert, say
> 
>     /* we can only get here when shared dictionaries are enabled */
>     Assert(max_shared_dictionaries_size > 0);

Fixed the assert and added the comment. I extended the assert, it also
takes into account -1 value.

> 4) I took the liberty of rewording some of the docs/comments. See the
> attached diffs, that should apply on top of 0003 and 0004 patches.
> Please, treat those as mere suggestions.

I applied your diffs and added changes to max_shared_dictionaries_size.

Please find the attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

20 March 2018, 23:30:15


On 03/20/2018 02:11 PM, Arthur Zakirov wrote:
> Hello,
> 
> On Mon, Mar 19, 2018 at 08:50:46PM +0100, Tomas Vondra wrote:
>> Hi Arthur,
>>
>> I went through the patch - just skimming through the diffs, will do more
>> testing tomorrow. Here are a few initial comments.
> 
> Thank you for the review!
> 
>> 1) max_shared_dictionaries_size / PGC_POSTMASTER
>>
>> I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
>> can't be changed after server start. That seems like a fairly useful
>> thing to do (e.g. increase the limit while the server is running), and
>> after looking at the code I think it shouldn't be difficult to change.
> 
> max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
> of a new value to disallow to set zero if there are loaded dictionaries
> and to decrease maximum allowed size if loaded size is greater than the
> new value.
> 

I wonder if these restrictions needed? I mean, why not to allow setting
max_shared_dictionaries_size below the size of loaded dictionaries?

Of course, on the one hand those restriction seem sensible. On the other
hand, perhaps in some cases it would be useful to allow violating them?

I mean, why not to simply disable loading of new dictionaries when

    (max_shared_dictionaries_size < loaded_size)

Maybe I'm over-thinking this though. It's probably safer and less
surprising to enforce the restrictions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

21 March 2018, 12:00:52

On Tue, Mar 20, 2018 at 09:30:15PM +0100, Tomas Vondra wrote:
> On 03/20/2018 02:11 PM, Arthur Zakirov wrote:
> > max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
> > of a new value to disallow to set zero if there are loaded dictionaries
> > and to decrease maximum allowed size if loaded size is greater than the
> > new value.
> > 
> 
> I wonder if these restrictions needed? I mean, why not to allow setting
> max_shared_dictionaries_size below the size of loaded dictionaries?
> 
> Of course, on the one hand those restriction seem sensible. On the other
> hand, perhaps in some cases it would be useful to allow violating them?
> 
> I mean, why not to simply disable loading of new dictionaries when
> 
>     (max_shared_dictionaries_size < loaded_size)
> 
> Maybe I'm over-thinking this though. It's probably safer and less
> surprising to enforce the restrictions.

Hm, yes in some cases this check may be over-engineering. I thought that
it is reasonable and safer in v7 patch. But there are similar GUCs,
wal_keep_segments and max_wal_size, which don't do additional checks.
And people are fine with them. So I removed that check from the variable.

Please find the attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

22 March 2018, 13:56:05

On Wed, Mar 21, 2018 at 12:00:52PM +0300, Arthur Zakirov wrote:
> On Tue, Mar 20, 2018 at 09:30:15PM +0100, Tomas Vondra wrote:
> > I wonder if these restrictions needed? I mean, why not to allow setting
> > max_shared_dictionaries_size below the size of loaded dictionaries?
> > 
> > Of course, on the one hand those restriction seem sensible. On the other
> > hand, perhaps in some cases it would be useful to allow violating them?
> > 
> > I mean, why not to simply disable loading of new dictionaries when
> > 
> >     (max_shared_dictionaries_size < loaded_size)
> > 
> > Maybe I'm over-thinking this though. It's probably safer and less
> > surprising to enforce the restrictions.
> 
> Hm, yes in some cases this check may be over-engineering. I thought that
> it is reasonable and safer in v7 patch. But there are similar GUCs,
> wal_keep_segments and max_wal_size, which don't do additional checks.
> And people are fine with them. So I removed that check from the variable.
> 
> Please find the attached new version of the patch.

I forgot to fix regression tests for max_shared_dictionaries_size. Also
I'm not confident about using pg_reload_conf() in regression tests.
I haven't found where pg_reload_conf() is used in tests. So I removed
max_shared_dictionaries_size tests for now.

Sorry for the noise.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Please find the attached new version of the patch.

I got rid of 0005 and 0006 parts. There are no
max_shared_dictionaries_size variable, Shareable option,
pg_ts_shared_dictionaries view anymore.

On Sat, Mar 24, 2018 at 04:56:36PM -0400, Tom Lane wrote:
> I do think it's required that changing the dictionary's options with
> ALTER TEXT SEARCH DICTIONARY automatically cause a reload; but if that's
> happening with this patch, I don't see where.  (It might work to use
> the combination of dictionary OID and TID of the dictionary's pg_ts_dict
> tuple as the lookup key for shared dictionaries.  Oh, and have you
> thought about the possibility of conflicting OIDs in different DBs?
> Probably the database OID has to be part of the key, as well.)

The database OID, the dictionary OID, TID and XMIN are used now as
lookup key.

> Also, the scheme for releasing the dictionary DSM during
> RemoveTSDictionaryById is uncertain and full of race conditions:
> the DROP might roll back later, or someone might come along and
> start using the dictionary (causing a fresh DSM load) before the
> DROP commits and makes the dictionary invisible to other sessions.
> I don't think that either of those are necessarily fatal objections,
> but there needs to be some commentary there explaining what happens.

The dictionary's DSM segment is alive till postmaster terminates now.
But when the dictionary is dropped or altered then the previous
(invalid) segment is unpinned. The segment itself is released when all
backends unpins mapping in lookup_ts_parser_cache() or by disconnecting.

The problem here comes when the dictionary was used before dropping or
altering by some process, isn't used after and the process lives a very
long time. In this situation the mapping isn't unpinned and the segment
isn't released. The other problem is that TsearchDictEntry isn't removed
if ts_dict_shmem_release() wasn't called. It may happen after dropping
the dictionary.

> BTW, I was going to complain that this patch alters the API for
> dictionary template init functions without any documentation updates;
> but then I realized that there isn't any documentation to update.
> That pretty well sucks, but I suppose it's not the job of this patch
> to improve that situation.  Still, you could spend a bit more effort on
> the commentary in ts_public.h in 0002, because that commentary is as
> close to an API spec as we've got.

I improved a little bit the commentary in ts_public.h.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

29 March 2018, 02:03:07

Here is the new version of the patch.

Now RemoveTSDictionaryById() and AlterTSDictionary() unpin the
dictionary DSM segment. So if all attached backends disconnect allocated
DSM segments will be released.

lookup_ts_dictionary_cache() may unping DSM mapping for all invalid
dictionary cache entries.

I added xmax in DictPointerData. It is used as a lookup key now too. It
helps to reload a dictionary after roll back DROP command.

There was a bug in ts_dict_shmem_location(), I fixed it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

31 March 2018, 13:42:31

Hello all,

I'd like to add new optional function to text search template named fini in addition to init() and lexize(). It will be called by RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will call ts_dict_shmem_release().

It doesn't change segments leaking situation. I think it makes text search API more transparent.

I'll update the existing documentation. And I think I can add text search API documentation in the 2018-09 commitfest, as Tom noticed that it doesn't exist.

Any thoughts?

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

31 March 2018, 18:53:22

On 03/31/2018 12:42 PM, Arthur Zakirov wrote:
> Hello all,
> 
> I'd like to add new optional function to text search template named fini
> in addition to init() and lexize(). It will be called by
> RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will
> call ts_dict_shmem_release().
> 
> It doesn't change segments leaking situation. I think it makes text
> search API more transparent.
> 

If it doesn't actually solve the problem, why add it? I don't see a
point in adding functions for the sake of transparency, when it does not
in fact serve any use cases.

Can't we handle the segment-leaking by adding some sort of tombstone?

For example, imagine that instead of removing the hash table entry we
mark it as 'dropped'. And after that, after the lookup we would know the
dictionary was removed, and the backends would load the dictionary into
their private memory.

Of course, this could mean we end up having many tombstones in the hash
table. But those tombstones would be tiny, making it less painful than
potentially leaking much more memory for the dictionaries.

Also, I wonder if we might actually remove the dictionaries after a
while, e.g. based on XID. Imagine that we note the XID of the
transaction removing the dictionary, or perhaps XID of the most recent
running transaction. Then we could use this to decide if all running
transactions actually see the DROP, and we could remove the tombstone.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

01 April 2018, 00:31:16

Tomas Vondra wrote:

On 03/31/2018 12:42 PM, Arthur Zakirov wrote:
> Hello all,
>
> I'd like to add new optional function to text search template named fini
> in addition to init() and lexize(). It will be called by
> RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will
> call ts_dict_shmem_release().
>
> It doesn't change segments leaking situation. I think it makes text
> search API more transparent.
>

If it doesn't actually solve the problem, why add it? I don't see a
point in adding functions for the sake of transparency, when it does not
in fact serve any use cases.

It doesn't solve the problem. But it brings more clearness, if a dictionary requested shared location then it should release/unpin it. There are no such scenario yet, but someone might want to release not only shared segment but also other private data.

Can't we handle the segment-leaking by adding some sort of tombstone?

It is interesting that there are such tombstones already, without the patch. TSDictionaryCacheEntry entries aren't deleted after DROP, they are just marked isvalid = false.

For example, imagine that instead of removing the hash table entry we
mark it as 'dropped'. And after that, after the lookup we would know the
dictionary was removed, and the backends would load the dictionary into
their private memory.

Of course, this could mean we end up having many tombstones in the hash
table. But those tombstones would be tiny, making it less painful than
potentially leaking much more memory for the dictionaries.

Now actually Isn't guaranteed that the hash table entry will be removed. Even if refcnt is 0. So I think I should remove refcnt and entries won't be removed.

There are no big problems with leaking now. Memory may leak only if a dictionary was dropped or altered and there is no text search workload anymore and the backend still alive. Because next using of text search functions will unpin segments used before for invalid dictionaries (isvalid == false). Also the segment is unpinned if the backend terminates. The segment is destroyed when all interested processes unpin the segment (as Tom noticed), the hash table entry becomes tombstone.

I hope I described clear.

Also, I wonder if we might actually remove the dictionaries after a
while, e.g. based on XID. Imagine that we note the XID of the
transaction removing the dictionary, or perhaps XID of the most recent
running transaction. Then we could use this to decide if all running
transactions actually see the DROP, and we could remove the tombstone.

Maybe autovacuum should work here too :) It is joke of course. I'm not very aware of removing dead tuples, but I think here is similar case.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

03 April 2018, 14:57:22

On Thu, Mar 29, 2018 at 02:03:07AM +0300, Arthur Zakirov wrote:
> Here is the new version of the patch.

Please find the attached new version of the patch.

I removed refcnt because it is useless, it doesn't guarantee that a hash
table entry will be removed.

I fixed a bug, dsm_unpin_segment() can be called twice if a transaction
which called it was aborted and another transaction calls
ts_dict_shmem_release(). I added segment_ispinned to fix it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Robert Haas

Date:

16 May 2018, 00:02:43

On Tue, Mar 27, 2018 at 8:19 AM, Arthur Zakirov
<a.zakirov@postgrespro.ru> wrote:
>> I assume the DSM infrastructure already has some solution for getting
>> rid of DSM segments when the last interested process disconnects,
>> so maybe you could piggyback on that somehow.
>
> Yes, there is dsm_pin_mapping() for this. But it is necessary to keep a
> segment even if there are no attached processes. From 0003:
>
> +       /* Remain attached until end of postmaster */
> +       dsm_pin_segment(seg);
> +       /* Remain attached until end of session */
> +       dsm_pin_mapping(seg);

I don't quite understand the problem you're trying to solve here, but:

1. Unless dsm_pin_segment() is called, a DSM segment will
automatically be removed when there are no remaining mappings.

2. Unless dsm_pin_mapping() is called, a DSM segment will be unmapped
when the currently-in-scope resource owner is cleaned up, like at the
end of the query.  If it is called, then the mapping will stick around
until the backend exits.

If you pin the mapping or the segment and later no longer want it
pinned, there are dsm_unpin_mapping() and dsm_unpin_segment()
functions available, too.  So it seems like what you might want to do
is pin the segment when it's created, and then unpin it if it's
stale/obsolete.  The latter won't remove it immediately, but will once
all the mappings are gone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

16 May 2018, 14:36:33

Hello,

On Tue, May 15, 2018 at 05:02:43PM -0400, Robert Haas wrote:
> On Tue, Mar 27, 2018 at 8:19 AM, Arthur Zakirov
> <a.zakirov@postgrespro.ru> wrote:
> > Yes, there is dsm_pin_mapping() for this. But it is necessary to keep a
> > segment even if there are no attached processes. From 0003:
> >
> > +       /* Remain attached until end of postmaster */
> > +       dsm_pin_segment(seg);
> > +       /* Remain attached until end of session */
> > +       dsm_pin_mapping(seg);
> 
> I don't quite understand the problem you're trying to solve here, but:
> 
> 1. Unless dsm_pin_segment() is called, a DSM segment will
> automatically be removed when there are no remaining mappings.
> 
> 2. Unless dsm_pin_mapping() is called, a DSM segment will be unmapped
> when the currently-in-scope resource owner is cleaned up, like at the
> end of the query.  If it is called, then the mapping will stick around
> until the backend exits.

I tried to solve the case when DSM segment remains mapped even a
dictionary was dropped. It may happen in the following situation:

Backend 1:

=# select ts_lexize('english_shared', 'test');
-- The dictionary is loaded into DSM, the segment and the mapping is
pinned
...
-- Call ts_lexize() from backend 2 below
=# drop text search dictionary english_shared;
-- The segment and the mapping is unpinned, see ts_dict_shmem_release()

Backend 2:

=# select ts_lexize('english_shared', 'test');
-- The dictionary got from DSM, the mapping is pinned
...
-- The dictionary was dropped by backend 1, but the mapping still is
pinned

As you can see the DSM still is pinned by backend 2. Later I fixed it by
checking do we need to unping segments. In the current version of the
patch do_ts_dict_shmem_release() is called in
lookup_ts_dictionary_cache(). It unpins segments if text search cache
was invalidated. It unpins all segments, but I think it is ok since
text search changes should be infrequent.

> If you pin the mapping or the segment and later no longer want it
> pinned, there are dsm_unpin_mapping() and dsm_unpin_segment()
> functions available, too.  So it seems like what you might want to do
> is pin the segment when it's created, and then unpin it if it's
> stale/obsolete.  The latter won't remove it immediately, but will once
> all the mappings are gone.

Yes, dsm_unpin_mapping() and dsm_unpin_segment() will be called when the
dictionary is dropped or altered in the current version of the patch. I
descriped the approach above.

In sum, I think the problem is mostly solved. Backend 2 unpins the
segment in next ts_lexize() call. But if backend 2 doesn't call
ts_lexize() (or other TS function) anymore the segment will remain mapped.
It is the only problem I see for now.

I hope the description is clear. I attached the rebased patch. 

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

On Wed, May 16, 2018 at 02:36:33PM +0300, Arthur Zakirov wrote:
> ... I attached the rebased patch.

I attached new version of the patch.

I found a bug when CompoundAffix,
SuffixNodes, PrefixNodes, DictNodes of IspellDictData structure are
empty. Now they have terminating entry and therefore they have at least
one node entry.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

01 October 2018, 12:22:06

On Thu, Jun 14, 2018 at 11:40:17AM +0300, Arthur Zakirov wrote:
> I attached new version of the patch.

The patch still applies to HEAD. I moved it to the next commitfest.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

09 January 2019, 12:06:38

On 01.10.2018 12:22, Arthur Zakirov wrote:
> On Thu, Jun 14, 2018 at 11:40:17AM +0300, Arthur Zakirov wrote:
>> I attached new version of the patch.
> 
> The patch still applies to HEAD. I moved it to the next commitfest.
> 

Here is the rebased patch. I also updated copyright in ts_shared.h and 
ts_shared.c.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Tomas Vondra

Date:

16 January 2019, 00:23:41

Hello Arthur,

I've looked at the patch today, and in general is seems quite solid to
me. I do have a couple of minor points

1) I think the comments need more work. Instead of describing all the
individual changes here, I've outlined those improvements in attached
patches (see the attached "tweaks" patches). Some of it is formatting,
minor rewording or larger changes. Some comments are rather redundant
(e.g. the one before calls to release the DSM segment).

2) It's not quite clear to me why we need DictInitData, which simply
combines DictPointerData and list of options. It seems as if the only
point is to pass a single parameter to the init function, but is it
worth it? Why not to get rid of DictInitData entirely and pass two
parameters instead?

3) I find it a bit cumbersome that before each ts_dict_shmem_release
call we construct a dummy DickPointerData value. Why not to pass
individual parameters and construct the struct in the function?

4) The reference to max_shared_dictionaries_size is obsolete, because
there's no such limit anymore.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

16 January 2019, 11:42:44

Hello Tomas,

On 16.01.2019 03:23, Tomas Vondra wrote:
> I've looked at the patch today, and in general is seems quite solid to
> me. I do have a couple of minor points
> 
> 1) I think the comments need more work. Instead of describing all the
> individual changes here, I've outlined those improvements in attached
> patches (see the attached "tweaks" patches). Some of it is formatting,
> minor rewording or larger changes. Some comments are rather redundant
> (e.g. the one before calls to release the DSM segment).

Thank you!

> 2) It's not quite clear to me why we need DictInitData, which simply
> combines DictPointerData and list of options. It seems as if the only
> point is to pass a single parameter to the init function, but is it
> worth it? Why not to get rid of DictInitData entirely and pass two
> parameters instead?

In the first place init method had two parameters. But in the v7 patch I 
added DictInitData struct instead of two parameters (list of options and 
DictPointerData):
https://www.postgresql.org/message-id/20180319110648.GA32319%40zakirov.localdomain

I haven't way to replace template's init method from 
init_method(internal) to init_method(internal,internal) in the upgrade 
script of extensions. If I'm not mistaken we need new syntax here, like 
ALTER TEXT SEARCH TEMPLATE. Thoughts?

> 3) I find it a bit cumbersome that before each ts_dict_shmem_release
> call we construct a dummy DickPointerData value. Why not to pass
> individual parameters and construct the struct in the function?

Agree, it may look too verbose. I'll change it.

> 4) The reference to max_shared_dictionaries_size is obsolete, because
> there's no such limit anymore.

Yeah, I'll fix it.

 > /* XXX not really a pointer, so the name is misleading */

I think we don't need DictPointerData struct anymore, because only 
ts_dict_shmem_release function needs it (see comments above) and we only 
need it to hash search. I'll move all fields of DictPointerData to 
TsearchDictKey struct.

 > XXX "supported" is not the same as "all ispell dicts behave like that".

I'll reword the sentence.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

17 January 2019, 14:15:07

I attached files of new version of the patch, I applied your tweaks.

 > XXX All dictionaries, but only when there's invalid dictionary?

I've made a little optimization. I introduced hashvalue into 
TSDictionaryCacheEntry. Now released only DSM of altered or dropped 
dictionaries.

 >  > /* XXX not really a pointer, so the name is misleading */
 >
 > I think we don't need DictPointerData struct anymore, because only
 > ts_dict_shmem_release function needs it (see comments above) and we only
 > need it to hash search. I'll move all fields of DictPointerData to
 > TsearchDictKey struct.

I was wrong, DictInitData also needs DictPointerData. I didn't remove 
DictPointerData, I renamed it to DictEntryData. Hope that it is a more 
appropriate name.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Hello,

I've created the new commitfest entry since the previous entry was 
closed with status "Returned with feedback":

https://commitfest.postgresql.org/22/2007/

I attached new version of the patch. There are changes only in 
0003-Retrieve-shared-location-for-dict-v18.patch.

I added a reference counter to shared hash tables dictionary entries. It 
is necessary to not face memory bloat. It is necessary to delete shared 
hash table entries if there are a lot of ALTER and DROP TEXT SEARCH 
DICTIONARY.

Previous version of the patch had released unused DSM segments but left 
shared hash table entries untouched.

There was refcnt before:

https://www.postgresql.org/message-id/20180403115720.GA7450%40zakirov.localdomain

But I didn't fully understand how on_dsm_detach() works.

On 22.01.2019 22:17, Tomas Vondra wrote:
> I think there are essentially two ways:
> 
> (a) Define max amount of memory available for shared dictionarires, and
> come up with an eviction algorithm. This will be tricky, because when
> the frequently-used dictionaries need a bit more memory than the limit,
> this will result in trashing (evict+load over and over).
> 
> (b) Define what "unused" means for dictionaries, and unload dictionaries
> that become unused. For example, we could track timestamp of the last
> time each dict was used, and decide that dictionaries unused for 5 or
> more minutes are unused. And evict those.
> 
> The advantage of (b) is that it adopts automatically, more or less. When
> you have a bunch of frequently used dictionaries, the amount of shared
> memory increases. If you stop using them, it decreases after a while.
> And rarely used dicts won't force eviction of the frequently used ones.
I'm working on the (b) approach. I thought about a priority queue 
structure. There no such ready structure within PostgreSQL sources 
except binaryheap.c, but it isn't for concurrent algorithms.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Hello hackers,

On 25.02.2019 14:33, Arthur Zakirov wrote:
>> The current approach inherently involves double-buffering: you've got
>> the filesystem cache containing the data read from disk, and then the
>> DSM containing the converted form of the data.  Having something that
>> you could just mmap() would avoid that, plus it would become a lot
>> less critical to keep the mappings around.  You could probably just
>> have individual queries mmap() it for as long as they need it and then
>> tear out the mapping when they finish executing; keeping the mappings
>> across queries likely wouldn't be too important in this case.
>>
>> The downside is that you'd probably need to teach resowner.c about
>> mappings created via mmap() so that you don't leak mappings on an
>> abort, but that's probably not a crazy difficult problem.
> 
> It seems to me Tom and Andres also vote for the mmap() approach. I think 
> I need to look closely at the mmap().
> 
> I've labeled the patch as 'v13'.

I've attached new version of the patch. Note that it is in WIP state for 
now and there are unresolved issues, which is listed at the end of the 
email.

The patch implements simple approach of using mmap(). Also I want to be 
sure that I'm going in right direction. Feel free to send a feedback.

On every dispell_init() call Postgres checks is there a shared 
dictionary file in the pg_shdict directory, if it is then calls mmap(). 
If there is no such file then it compiles the dictionary, write it to 
the file and calls mmap().

dispell_lexize() works with already mmap'ed dictionary. So it doesn't 
mmap() for each individual query as Robert proposed above. It's because 
such approach reduces performance twice (I tested with ts_lexize() calls 
by pgbench).

Tests
-----

Like in:
https://www.postgresql.org/message-id/20180124172039.GA11210%40zakirov.localdomain

i performed tests. There are now big differences in numbers except that 
files are being created now in the pg_shdict directory:

czech_hunspell - 9.2 MB file
english_hunspell - 1.9 MB file
french_hunspell - 4.6 MB file

TODO
----

- Improve the documentation and comments.
- Eliminate shared dictionary files after DROP/ALTER calls. It necessary 
to come up with some fancy file name. For now it is just OID of a 
dictionary. So it is possible to add database OID, xmin or xmax into a 
file name.
- We cant remove the file right away after DROP/ALTER. Is it good idea 
to use autovacuum here?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From

Alvaro Herrera

Date:

05 April 2019, 17:39:12

Is 0001 a bugfix?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Shared Ispell dictionaries

From

Arthur Zakirov

Date:

05 April 2019, 19:35:51

On Fri, Apr 5, 2019 at 8:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Is 0001 a bugfix?

Yep, it is rather a bugfix and can be applied independently.

The fix allocates temporary strings using temporary context
Conf->buildCxt.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company