Thread: [PROPOSAL] Shared Ispell dictionaries

[PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello, hackers!

Introduction
------------

I'm going to implement a patch which will store Ispell dictionaries in a shared memory.

There is an extension shared_ispell [1], developed by Tomas Vondra. But it is a bad candidate for including into
contrib.
Because it should know a lot of information about IspellDict struct to copy it into a shared memory.

Why
---

Shared Ispell dictionary gives the following improvements:
- consume less memory - Ispell dictionary loads into memory for every backends and requires for some dictionaries more
than100Mb
 
- there is no overhead during first call of a full text search function (such as to_tsvector(), to_tsquery())

Implementation
--------------

It is necessary to change all structures related with IspellDict: SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict
itself.They all shouldn't use pointers for this reason. Others are used only during dictionary building.
 
It would be good to store in a shared memory StopList struct too.

All fields of IspellDict struct, which are used only during dictionary building, will be move into new IspellDictBuild
todecrease needed shared memory size. And they are going to be released by buildCxt.
 

Each dictionary will be stored in its own dsm segment. Structures for regular expressions won't be stored in a shared
memory.They are compiled for every backend.
 

The patch will be ready and added into the 2018-03 commitfest.

Thank you for your attention. Any thoughts?


1 - github.com/tvondra/shared_ispell or github.com/postgrespro/shared_ispell

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Alvaro Herrera
Date:
Arthur Zakirov wrote:

> Implementation
> --------------
> 
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01?  Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Pavel Stehule
Date:


2017-12-26 17:55 GMT+01:00 Alvaro Herrera <alvherre@alvh.no-ip.org>:
Arthur Zakirov wrote:

> Implementation
> --------------
>
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01?  Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

Tomas had some workable patches related to this topic

Regards

Pavel

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Thank you for your feedback.

On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> So what are you going to use instead?

For example, AffixNode and AffixNodeData represent prefix tree of an
affix list. They are accessed by Suffix and Prefix pointers of
IspellDict struct now. Instead all affix nodes should be placed into an
array and accessed by an offset. Suffix array goes first, Prefix array
goes after. AffixNodeData will access to a child node by an offset too.

AffixNodeData struct has the array of pointers to AFFIX struct. These
array with all AFFIX data can be stored within AffixNodeData. Or
AffixNodeData can have an array of indexes to a single AFFIX array,
which stored within IspellDict before or after Suffix and Prefix.

Same for prefix tree of a word list, represented by SPNode struct. It
might by stored as an array after the Prefix array.

AffixData and CompoundAffix arrays go after them.

To allocate IspellDict in this case it is necessary to calculate needed
memory size. I think arrays mentioned above will be built first then
memcpy'ed into IspellDict, if it won't take much time.

Hope it makes sense and is reasonable.

> 
> So this will be a large patch not submitted to 2018-01?  Depending on
> size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> too late.
> 

Oh, I see. I try to prepare the patch while 2018-01 is open.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Alvaro Herrera
Date:
Arthur Zakirov wrote:

> On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> > So what are you going to use instead?
>
> [ ... ]
>
> To allocate IspellDict in this case it is necessary to calculate needed
> memory size. I think arrays mentioned above will be built first then
> memcpy'ed into IspellDict, if it won't take much time.

OK, that sounds sensible on first blush.  If there are many processes
concurrently doing text searches, then the amnount of memory saved may
be large enough to justify the additional processing (moreso if it's
just one more memcpy cycle).

I hope that there is some way to cope with the ispell data changing
underneath -- maybe you'll need some sort of RCU?

> > So this will be a large patch not submitted to 2018-01?  Depending on
> > size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> > too late.
> 
> Oh, I see. I try to prepare the patch while 2018-01 is open.

It isn't necessary that the patch to present to 2018-01 is final and
complete (so don't kill yourself to achieve that) -- a preliminary patch
that reviewers can comment on is enough, as long as the final patch you
present to 2018-03 is not *too* different.  But any medium-large patch
whose first post is to the last commitfest of a cycle is likely to be
thrown out to the next cycle's first commitfest very quickly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
> 
> Tomas had some workable patches related to this topic
> 

Tomas, have you planned to propose it?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello, hackers,

On Tue, Dec 26, 2017 at 07:48:27PM +0300, Arthur Zakirov wrote:
> The patch will be ready and added into the 2018-03 commitfest.
> 

I attached the patch itself.

0001-Fix-ispell-memory-handling.patch:

Some strings are allocated via compact_palloc0(). But they are not
persistent, so they should be allocated using temporary memory context.
Also a couple strings are not released if .aff file had new format.

0002-Retreive-shmem-location-for-ispell.patch:

Adds ispell_shmem_location() function which look for location for a
dictionary using .dict and .aff file names. If the location haven't been
allocated in DSM earlier, allocate it. Shared hash table is used here to
search the location.

Maximum number of elements of hash table is NUM_DICTIONARIES=20 now. It
will be better to use a GUC-variable. Also if the number of elements
reached the limit then it will be good to use backend's local memory
instead of shared.

0003-Store-ispell-structures-in-shmem.patch:

Introduces IspellDictBuild and IspellDictData structures, removes
IspellDict structure. IspellDictBuild is used during building the
dictionary, if it haven't been allocated in DSM earlier, within
dispell_build() function. IspellDictBuild has a pointer to
IspellDictData structure, which will be filled with persistent data.

After building the dictionary IspellDictData is copied into
DSM location and temporary data of IspellDictBuild is released.

All prefix trees are stored as a flat array now. Those arrays are
allocated and stored using NodeArray struct now. Required node can be
retreied by node offset. AffixData and Affix arrays have additional
offset array to retreive an element by index.

Affix field (array of AFFIX) of IspellDictBuild is persistent data also. But it is
constructed as a temporary array first, Affix array need to be sorted
via qsort() within NISortAffixes().

So IspellDictData stores:
- AffixData - array of strings, access via AffixDataOffset
- Affix - array of AFFIX, access via AffixOffset
- DictNodes, PrefixNodes, SuffixNodes - prefix trees as a plain array
- CompoundAffix - array of CMPDAffix sequential access

I had to remove compact_palloc0() added by Pavel in
3e5f9412d0a818be77c974e5af710928097b91f3. Ispell dictionary doesn't need
such allocation anymore. It was used to allocate a little locations. I
will definity check performance of Czech dictionary.


There are issues to do:
- add the GUC-variable for hash table limit
- fix bugs
- improve comments
- performance testing

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sun, Dec 31, 2017 at 06:28:13PM +0300, Arthur Zakirov wrote:
> 
> There are issues to do:
> - add the GUC-variable for hash table limit
> - fix bugs
> - improve comments
> - performance testing
> 

Here is the second version of the patch.

0002-Retreive-shmem-location-for-ispell-v2.patch:

Fixed some bugs and added the GUC variable "shared_dictionaries".

Added documentation for it. I'm not sure about the order of configuration parameters in section "19.4.1.
Memory". Now "shared_dictionaries" goes after "shared_buffers". Maybe it
will be good to make a patch wich will sort parameters in alphabetical
order?

0003-Store-ispell-structures-in-shmem-v2.patch:

Fixed some bugs, regression tests pass now. I added more comments
and fixed old. I also tested with Hunspell dictionaries [1]. They are
good too.

Results of performance testing of Ispell and Hunspell dictionaries will
be ready soon.


1 - github.com/postgrespro/hunspell_dicts

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
Hi Arthur,

Sorry for the delay, I somehow missed this thread ...

On 12/27/2017 10:20 AM, Arthur Zakirov wrote:
> On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
>>
>> Tomas had some workable patches related to this topic
>>
> 
> Tomas, have you planned to propose it?
> 

I believe Pavel was referring to this extension:

    https://github.com/tvondra/shared_ispell

I wasn't going to submit that as in-core solution, but I'm happy you're
making improvements in that direction. I'll take a look at your patch
shortly.

ragards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Thank you for your answer.

On Mon, Jan 08, 2018 at 06:12:37PM +0100, Tomas Vondra wrote:
> 
> I believe Pavel was referring to this extension:
> 
>     https://github.com/tvondra/shared_ispell
> 

Oh, understood.

> I wasn't going to submit that as in-core solution, but I'm happy you're
> making improvements in that direction. I'll take a look at your patch
> shortly.
> 

There is the second version of the patch. But I've noticed a performance regression in ts_lexize() and I will try to
findwhere the overhead hides.
 

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
Hi Arthur,

I've done some initial review of the patch today, and here are some
thoughts:

0001-Fix-ispell-memory-handling-v2.patch

This makes sense. The patch simply replaces two cpstrdup() calls with
MemoryContextStrdup, but I see spell.c already has two macros to
allocate memory in the buildCxt. What about adding tmpstrdup to copy a
string into the context? I admit this is mostly nitpicking though.


0002-Retreive-shmem-location-for-ispell-v2.patch

I think the GUC name should make it clear it's a maximum number of
something, just like "max_parallel_workers" and other such GUCs. When I
first saw "shared_dictionaries" in the patch I thought it's a list of
dictionary names, or something like that.

I have a bunch of additional design questions and proposals (not
necessarily required for v1, but perhaps useful for shaping it).


1) Why do we actually need the limit? Is it really necessary / useful?

When I wrote shared_ispell back in 2012, all we had were fixed segments
allocated at start, and so similar limits were a built-in restriction.
But after the DSM stuff was introduced I imagined it would not be necessary.

I realize the current implementation requires that, because the hash
table is still created in an old-style memory context (and only the
dictionaries are in DSM segments).

But that seems fairly straightforward to fix by maintaining the hash
table in a separate DSM segment too. So lookup of the dictionary DSM
would have to fist check what the current hash table segment is, and
then continue as now.

I'm not sure if dynahash can live in a DSM segment, but we already have
a hash table that supports that in dshash.c (which is also concurrent,
although I'm not sure if that's a major advantage for this use case).


2) Do we actually want/need some limits? Which ones?

That is not to say we don't need/want some limits, but the current limit
may not be the droid we're looking for, for a couple of reasons.

Firstly, currently it only matters during startup, when the dynahash is
created. So to change the limit (e.g. to increase it) you actually have
to restart the database, which is obviously a major hassle.

Secondly, dynahash tweaks the values to get proper behavior. For example
it's not using the values directly but some higher value of 2^N form.
Which means the limit may not enforced immediately when hitting the GUC,
but unexpectedly somewhat later.

And finally, I believe this is log-worthy - right now the dictionary
load silently switches to backend memory (thus incurring all the parsing
overhead). This certainly deserves at least a log message.

Actually, I'm not sure "number of dictionaries" is a particularly useful
limit in the first place - that's not a number I really care about. But
I do care about amount of memory consumed by the loaded dictionaries.

So I do suggest adding such "max memory for shared dictionaries" limit.
I'm not sure we can enforce it strictly, because when deciding where to
load the dict we haven't parsed it yet and so don't know how much memory
will be required. But I believe a lazy check should be fine (load it,
and if we exceeded the total memory disable loading additional ones).


3) How do I unload a dictionary from the shared memory?

Assume we've reached the limit (it does not matter if it's the number of
dictionaries or memory used by them). How do I resolve that without
restarting the database? How do I unload a dictionary (which may be
unused) from shared memory?

    ALTER TEXT SEARCH DICTIONARY x UNLOAD


4) How do I reload a dictionary?

Assume I've updated the dictionary files (added new words into the
files, or something like that). How do I reload the dictionary? Do I
have to restart the server, DROP/CREATE everything again, or what?

What about instead having something like this:

    ALTER TEXT SEARCH DICTIONARY x RELOAD


5) Actually, how do I list currently loaded dictionaries (and how much
memory they use in the shared memory)?


6) What other restrictions would be useful?

I think it should be possible to specify which ispell dictionaries may
be loaded into shared memory, and which should be always loaded into
local backend memory. That is, something like

    CREATE TEXT SEARCH DICTIONARY x (
        TEMPLATE = ispell,
        DictFile = czech,
        AffFile = czech,
        StopWords = czech,
        SharedMemory = true/false (default: false)
    );

because otherwise the dictionaries will compete for shared memory, and
it's unclear which of them will get loaded. For a server with a single
application that may not be a huge issue, but think about servers shared
by multiple applications, etc.

In the extension this was achieved kinda explicitly by definition of a
separate 'shared_ispell' template, but if you modify the current one
that won't work, of course.


7) You mentioned you had to get rid of the compact_palloc0 - can you
elaborate a bit why that was necessary? Also, when benchmarking the
impact of this make sure to measure not only the time but also memory
consumption.

In fact, that was the main reason why Pavel implemented it in 2010,
because the czech dictionary takes quite a bit of memory, and without
the shared memory a copy was kept in every backend.

Of course, maybe that would be mostly irrelevant thanks to this patch
(due to changes to the representation and keeping just a single copy).


8) One more thing - I've noticed that the hash table uses this key:

    typedef struct
    {
        char        dictfile[MAXPGPATH];
        char        afffile[MAXPGPATH];
    } TsearchDictKey;

That is, full paths to the two files, and I'm not sure that's a very
good idea. Firstly, it's a bit wasteful (1kB per path). But more
importantly it means all dictionaries referencing the same files will
share the same chunk of shared memory - not only within a single
database, but across the whole cluster. That may lead to surprising
behavior, because e.g. unloading a dictionary in one database will
affect dictionaries in all other databases referencing the same files.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

Thank you Tomas for your review.

On Sat, Jan 13, 2018 at 03:25:55AM +0100, Tomas Vondra wrote:
> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

I agree about tmpstrdup(). It will be self consistent with tmpalloc().

> 1) Why do we actually need the limit? Is it really necessary / useful?
> ...
> I realize the current implementation requires that, because the hash
> table is still created in an old-style memory context (and only the
> dictionaries are in DSM segments).

Yes indeed. I tried to implement dynahash via DSM, but I failed. It
seems to me that dynahash can work only in an old-style memory context.

> I'm not sure if dynahash can live in a DSM segment, but we already have
> a hash table that supports that in dshash.c (which is also concurrent,
> although I'm not sure if that's a major advantage for this use case).

Thank you a lot for pointing on dshash.c. I think this is just what we
need. I will try to use it in the new version of the patch.

> 2) Do we actually want/need some limits? Which ones?
> ...
> And finally, I believe this is log-worthy - right now the dictionary
> load silently switches to backend memory (thus incurring all the parsing
> overhead). This certainly deserves at least a log message.

I think such log message may be usefull, so I will add it too.

> So I do suggest adding such "max memory for shared dictionaries" limit.
> I'm not sure we can enforce it strictly, because when deciding where to
> load the dict we haven't parsed it yet and so don't know how much memory
> will be required. But I believe a lazy check should be fine (load it,
> and if we exceeded the total memory disable loading additional ones).

With dshash in DSM it seems that shared_dictionaries GUC variable is not needed
anymore. I aggree that another GUC variable (for example,
max_shared_dictionaries_size) may be useful. But maybe it's worth
checking the size of a dictionary only after actual compiling? We can do
the following:
- within ispell_shmem_location() build a dictionary using the callback
  function
- the callback function returns its size, if the dictionary doesn't fit
  into a remaining shared space ispell_shmem_location() just will return
  pointer to the palloc'ed and compiled dictionary without creating a
  DSM segment.

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I think these syntax will be very useful not only for Ispell but for
other dictionaries too. So init_function of a text search template may
return pointers for a C funtions which unload and reload dictionaries.
This approach doesn't require to change the catalog by adding additional
functions for the template [1].

If init_function of a template didn't return pointers then this template
doesn't support unloading or reloading. And UNLOAD and RELOAD commands
should throw an error if a user calles them for such template.

> 5) Actually, how do I list currently loaded dictionaries (and how much
> memory they use in the shared memory)?

This is may be very useful too. This function can be called as
pg_get_shared_dictionaries().

> 6) What other restrictions would be useful?
> ...
>     CREATE TEXT SEARCH DICTIONARY x (
>         TEMPLATE = ispell,
>         DictFile = czech,
>         AffFile = czech,
>         StopWords = czech,
>         SharedMemory = true/false (default: false)
>     );

Hm, I didn't think about such option. It will be a very simple way
of shared dictionary control for a user.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

As I understood from the commit 3e5f9412d0a818be77c974e5af710928097b91f3
compact_palloc0() reduces overhead from a lot of palloc's for small
chunks of data. And persistent data of the patch should not suffer from
this overhead, because persistent data is allocated using big chunks.

But now I realized that we can keep compact_palloc0() for small chunks of
temporary data. And it may be worth to save compact_palloc0().

> 8) One more thing - I've noticed that the hash table uses this key:
> ...
> That is, full paths to the two files, and I'm not sure that's a very
> good idea. Firstly, it's a bit wasteful (1kB per path). But more
> importantly it means all dictionaries referencing the same files will
> share the same chunk of shared memory - not only within a single
> database, but across the whole cluster. That may lead to surprising
> behavior, because e.g. unloading a dictionary in one database will
> affect dictionaries in all other databases referencing the same files.

Hm, indeed. It's worth to use only file names instead full paths. And
it is good idea to use more information besides file names. It can be
Oid of a database and Oid of a namespace maybe, because a
dictionary can be created in different schemas.

I think your proposals may be implemented in several patches, so they can
be applyed independently but consistently. I suppose I will prepare new
version of the patch with fixes and with initial design of new functions
and commands soon.


1 - https://www.postgresql.org/docs/current/static/sql-createtstemplate.html

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 01/13/2018 04:22 PM, Arthur Zakirov wrote:
> Hello,
> 
> Thank you Tomas for your review.
> 
> On Sat, Jan 13, 2018 at 03:25:55AM +0100, Tomas Vondra wrote:
>> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
>> string into the context? I admit this is mostly nitpicking though.
> 
> ... snip ...>
>> 8) One more thing - I've noticed that the hash table uses this key:
>> ...
>> That is, full paths to the two files, and I'm not sure that's a very
>> good idea. Firstly, it's a bit wasteful (1kB per path). But more
>> importantly it means all dictionaries referencing the same files will
>> share the same chunk of shared memory - not only within a single
>> database, but across the whole cluster. That may lead to surprising
>> behavior, because e.g. unloading a dictionary in one database will
>> affect dictionaries in all other databases referencing the same files.
> 
> Hm, indeed. It's worth to use only file names instead full paths. And
> it is good idea to use more information besides file names. It can be
> Oid of a database and Oid of a namespace maybe, because a
> dictionary can be created in different schemas.
> 

I doubt using filenames (without the directory paths) solves anything,
really. The keys still have to be MAXPGPATH because someone could create
very long filename. But I don't think memory consumption is such a big
deal, really. With 1000 dictionaries it's still just ~2MB of data, which
is negligible compared to the amount of memory saved by sharing the
dictionaries.

Not sure if we really need to add the database/schema OIDs. I mentioned
the unexpected consequences (cross-db sharing) but maybe that's a
feature we should keep (it reduces memory usage). So perhaps this should
be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
dictionary with other databases?

Aren't we overengineering this?

> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.
> 

Yes, splitting patches into smaller, more focused bits is a good idea.

BTW the current patch fails to document the dictionary sharing. It only
mentions it when describing the shared_dictionaries GUC. IMHO the right
place for additional details is

https://www.postgresql.org/docs/10/static/textsearch-dictionaries.html


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sat, Jan 13, 2018 at 10:33:14PM +0100, Tomas Vondra wrote:
> Not sure if we really need to add the database/schema OIDs. I mentioned
> the unexpected consequences (cross-db sharing) but maybe that's a
> feature we should keep (it reduces memory usage). So perhaps this should
> be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
> dictionary with other databases?
> 
> Aren't we overengineering this?

Another related problem I've noticed is memory leak. When a dictionary
loaded and then dropped it won't be unloaded. I see several approaches:

1 - Use Oid of the dictionary itself as the key instead dictfile and
afffile. When the dictionary is dropped it will be easily unloaded if it
was loaded. Implementing should be easy, but the drawback is more memory consumption.
2 - Use reference counter with cross-db sharing. When the dictionary is
loaded the counter increases. If all record of loaded dictionary is dropped
it will be unloaded.
3 - Or reference counters without cross-db sharing to avoid possible confusing.
Here dictfile, afffile and database Oid will be used as the key.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 01/15/2018 08:02 PM, Arthur Zakirov wrote:
> On Sat, Jan 13, 2018 at 10:33:14PM +0100, Tomas Vondra wrote:
>> Not sure if we really need to add the database/schema OIDs. I mentioned
>> the unexpected consequences (cross-db sharing) but maybe that's a
>> feature we should keep (it reduces memory usage). So perhaps this should
>> be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
>> dictionary with other databases?
>>
>> Aren't we overengineering this?
> 
> Another related problem I've noticed is memory leak. When a
> dictionary loaded and then dropped it won't be unloaded.
> 

Good point.

> I see several approaches:
>> 1 - Use Oid of the dictionary itself as the key instead dictfile and
> afffile. When the dictionary is dropped it will be easily unloaded if it
> was loaded. Implementing should be easy, but the drawback is more memory consumption.
> 2 - Use reference counter with cross-db sharing. When the dictionary is
> loaded the counter increases. If all record of loaded dictionary is dropped
> it will be unloaded.
> 3 - Or reference counters without cross-db sharing to avoid possible confusing.
> Here dictfile, afffile and database Oid will be used as the key.
> 

I think you're approaching the problem from the right direction, hence
asking the wrong question.

I think the primary question is "Do we want to share dictionaries cross
databases?" and the answer will determine which of the tree options is
the right one.

Another important consideration is the complexity of the patch. In fact,
I suggest to make it your goal to make the initial patch as simple as
possible. If something is "nice to have" it may wait for v2.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an
instance.A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all.
 

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server
start.If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which
includesOID of the dictionary.
 

Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called.
Itunpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated
sharedmemory will be released by Postgres.
 

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of
dictionariescurrently in shared memory, with the columns:
 
- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that
peoplewill haven't to do anything to allocate dictionaries in shared memory.
 

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't
forceto invalidate dictionaries cache, if I'm not mistaken.
 

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We
alreadyhave text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there
areto approach for ts_unload():
 
- use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the
drawbackthat it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping.
 
- use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires
morework to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and
someoneelse calls dsa_free() at the same time.
 

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more
memoryafter the patch.
 

Tests
-----

I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is
434e6e1484418c55561914600de9e180fc408378.I've measured french dictionary too because it has even bigger affix file than
czechdictionary.
 

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used

Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are
storedin backend's memory still.
 

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used

You can see that dictionaries now takes almost two times less memory.

pgbench with select only script:

SELECT ts_lexize('czech_hunspell', 'slon');
patch: 30431 TPS
master: 30419 TPS

SELECT ts_lexize('english_hunspell', 'elephant'):
patch: 35029 TPS
master: 35276 TPS

SELECT ts_lexize('french_hunspell', 'éléphante');
patch: 22264 TPS
master: 22744 TPS

1 - https://github.com/postgrespro/hunspell_dicts

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
Hi,

On 01/24/2018 06:20 PM, Arthur Zakirov wrote:
> On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
>> I think your proposals may be implemented in several patches, so
>> they can be applyed independently but consistently. I suppose I
>> will prepare new version of the patch with fixes and with initial
>> design of new functions and commands soon.
> 
> I attached new version of the patch.
> 

Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

>> 3) How do I unload a dictionary from the shared memory?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>>
>> 4) How do I reload a dictionary?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x RELOAD
> 
> I thought about it. And it seems to me that we can use functions 
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

> I think there are to approach for ts_unload():> - use DSM's pin and unpin methods and the invalidation callback, as
> it done during fixing memory leak. It has the drawback that it won't
> have an immediate effect, because DSM will be released only when all
> backends unpin DSM mapping.
> - use DSA and dsa_free() method. As far as I understand dsa_free() 
> frees allocated memory immediatly. But it requires more work to do, 
> because we will need some more locks. For instance, what happens
> when someone calls ts_lexize() and someone else calls dsa_free() at
> the same time.


No opinion on this yet, I have to think about it for a bit and look at
the code first.

>> 7) You mentioned you had to get rid of the compact_palloc0 - can you
>> elaborate a bit why that was necessary? Also, when benchmarking the
>> impact of this make sure to measure not only the time but also memory
>> consumption.
> 
> It seems to me that there is no need compact_palloc0() anymore. Tests
> show that czech dictionary doesn't consume more memory after the
> patch.
> 

That's interesting. I'll do some additional tests to verify the finding.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
2018-01-24 20:57 GMT+03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

Thank you for your thoughts.
 
> I thought about it. And it seems to me that we can use functions
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

I understood your opinion. I haven't strong opinion on the subject yet.
And I agree that they can be implemented in future improvements for shared
dictionaries.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Ildus Kurbangaliev
Date:
On Wed, 24 Jan 2018 20:20:41 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

Hi, I did some review of the patch.

In 0001 there are few lines where is only indentation has changed.

0002:
- TsearchShmemSize - calculating size using hash_estimate_size seems
redundant since you use DSA hash now.
- ts_dict_shmem_release - LWLockAcquire in the beginning makes no
  sense, since dict_table couldn't change anyway.

0003:
- ts_dict_shmem_location could return IspellDictData, it makes more
  sense.

0006:
It's very subjective, but I think it would nicer to call option as
Shared (as property of dictionary) or UseSharedMemory, the boolean
option called SharedMemory sounds weird.

Overall the patches look good, all tests passed. I tried to broke it in
few places where I thought it could be unsafe but not succeeded.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

Thank you for your review! Good catches.

On Thu, Jan 25, 2018 at 03:26:46PM +0300, Ildus Kurbangaliev wrote:
> In 0001 there are few lines where is only indentation has changed.

Fixed.

> 0002:
> - TsearchShmemSize - calculating size using hash_estimate_size seems
> redundant since you use DSA hash now.

Fixed. True, there is no need in hash_estimate_size anymore.

> - ts_dict_shmem_release - LWLockAcquire in the beginning makes no
>   sense, since dict_table couldn't change anyway.

Fixed. In earlier version tsearch_ctl was used here, but I forgot to remove LWLockAcquire.

> 0003:
> - ts_dict_shmem_location could return IspellDictData, it makes more
>   sense.

I assume that ts_dict_shmem_location can be used by various types of dictionaries, not only by Ispell. So void * more
suitablehere.
 

> 0006:
> It's very subjective, but I think it would nicer to call option as
> Shared (as property of dictionary) or UseSharedMemory, the boolean
> option called SharedMemory sounds weird.

Agree. In our offline conversation we came to Shareable, that is a dictionary can be shared. It may be more appropriate
becausesetting Shareable=true doesn't guarantee that a dictionary will be allocated in shared memory due to
max_shared_dictionaries_sizeGUC.
 

Attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, Jan 25, 2018 at 07:51:58PM +0300, Arthur Zakirov wrote:
> Attached new version of the patch.

Here is rebased version of the patch due to changes into dict_ispell.c.
The patch itself wasn't changed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:
Hi,

On 2018-02-07 19:28:29 +0300, Arthur Zakirov wrote:
> +    {
> +        {"max_shared_dictionaries_size", PGC_POSTMASTER, RESOURCES_MEM,
> +            gettext_noop("Sets the maximum size of all text search dictionaries loaded into shared memory."),
> +            gettext_noop("Currently controls only loading of Ispell dictionaries. "
> +                         "If total size of simultaneously loaded dictionaries "
> +                         "reaches the maximum allowed size then a new dictionary "
> +                         "will be loaded into local memory of a backend."),
> +            GUC_UNIT_KB,
> +        },
> +        &max_shared_dictionaries_size,
> +        100 * 1024, 0, MAX_KILOBYTES,
> +        NULL, NULL, NULL
> +    },

So this uses shared memory, allocated at server start?  That doesn't
seem right. Wouldn't it make more sense to have a
'num_shared_dictionaries' GUC, and then allocate them with dsm? Or even
better not have any such limit and us a dshash table to point to
individual loaded tables?

Is there any chance we can instead can convert dictionaries into a form
we can just mmap() into memory?  That'd scale a lot higher and more
dynamicallly?

Regards,

Andres


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

Thank you for your comments.

On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2018-02-07 19:28:29 +0300, Arthur Zakirov wrote:
> > +    {
> > +        {"max_shared_dictionaries_size", PGC_POSTMASTER, RESOURCES_MEM,
> > +            gettext_noop("Sets the maximum size of all text search dictionaries loaded into shared memory."),
> > +            gettext_noop("Currently controls only loading of Ispell dictionaries. "
> > +                         "If total size of simultaneously loaded dictionaries "
> > +                         "reaches the maximum allowed size then a new dictionary "
> > +                         "will be loaded into local memory of a backend."),
> > +            GUC_UNIT_KB,
> > +        },
> > +        &max_shared_dictionaries_size,
> > +        100 * 1024, 0, MAX_KILOBYTES,
> > +        NULL, NULL, NULL
> > +    },
> 
> So this uses shared memory, allocated at server start?  That doesn't
> seem right. Wouldn't it make more sense to have a
> 'num_shared_dictionaries' GUC, and then allocate them with dsm? Or even
> better not have any such limit and us a dshash table to point to
> individual loaded tables?

The patch uses dsm and dshash table already.
'max_shared_dictionaries_size' GUC was introduced after discussion with
Tomas [1]. To limit amount of memory consumed by loaded dictionaries and to
prevent possible memory bloating. Its default value is 100MB.

There was 'shared_dictionaries' GUC before, it was introduced because
usual hash tables was used before, not dshash. I replaced usual hash
tables by dshash, removed 'shared_dictionaries' and added
'max_shared_dictionaries_size'. 

> Is there any chance we can instead can convert dictionaries into a form
> we can just mmap() into memory?  That'd scale a lot higher and more
> dynamicallly?

I think new IspellDictData structure (in 0003-Store-ispell-structures-in-shmem-v5.patch)
can be stored in a binary file and mapped into memory already. But
mmap() is not used in this patch yet.

I can do some experiments and make a prototype.


1 - https://www.postgresql.org/message-id/d12d9395-922c-64c9-c87d-dd0e1d31440e%402ndquadrant.com

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Feb 07, 2018 at 07:28:29PM +0300, Arthur Zakirov wrote:
> Here is rebased version of the patch due to changes into dict_ispell.c.
> The patch itself wasn't changed.

Here is rebased version of the patch due to changes within pg_proc.h.
I haven't implemented a mmap prototype yet, though.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello Andres,

On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
> Is there any chance we can instead can convert dictionaries into a form
> we can just mmap() into memory?  That'd scale a lot higher and more
> dynamicallly?

To avoid misunderstanding can you please elaborate on using mmap()? The
DSM approach looks like more simple and requires less code. Also DSM may
use mmap() if I'm not mistaken.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 03/07/2018 09:55 AM, Arthur Zakirov wrote:
> Hello Andres,
> 
> On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
>> Is there any chance we can instead can convert dictionaries into a form
>> we can just mmap() into memory?  That'd scale a lot higher and more
>> dynamicallly?
> 
> To avoid misunderstanding can you please elaborate on using mmap()? The
> DSM approach looks like more simple and requires less code. Also DSM may
> use mmap() if I'm not mistaken.
> 

I think the mmap() idea is that you preprocess the dictionary, store the
result in a file, and then mmap it when needed, without the expensive
preprocessing.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Mar 07, 2018 at 10:55:29AM +0100, Tomas Vondra wrote:
> On 03/07/2018 09:55 AM, Arthur Zakirov wrote:
> > Hello Andres,
> > 
> > On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
> >> Is there any chance we can instead can convert dictionaries into a form
> >> we can just mmap() into memory?  That'd scale a lot higher and more
> >> dynamicallly?
> > 
> > To avoid misunderstanding can you please elaborate on using mmap()? The
> > DSM approach looks like more simple and requires less code. Also DSM may
> > use mmap() if I'm not mistaken.
> > 
> 
> I think the mmap() idea is that you preprocess the dictionary, store the
> result in a file, and then mmap it when needed, without the expensive
> preprocessing.

Understand. I'm not againts the mmap() approach, just I have lack of
understanding mmap() benefits... Current shared Ispell approach requires
preprocessing after server restarting, and the main advantage of mmap() here
is that mmap() doesn't require preprocessing after restarting.

Speaking about the implementation.

It seems that the most appropriate place to store preprocessed files is
'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
dsm_cleanup_for_mmap() will remove them.

I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
maybe it's worth to reuse them.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Pavel Stehule
Date:


2018-03-07 12:55 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
On Wed, Mar 07, 2018 at 10:55:29AM +0100, Tomas Vondra wrote:
> On 03/07/2018 09:55 AM, Arthur Zakirov wrote:
> > Hello Andres,
> >
> > On Thu, Mar 01, 2018 at 08:31:49PM -0800, Andres Freund wrote:
> >> Is there any chance we can instead can convert dictionaries into a form
> >> we can just mmap() into memory?  That'd scale a lot higher and more
> >> dynamicallly?
> >
> > To avoid misunderstanding can you please elaborate on using mmap()? The
> > DSM approach looks like more simple and requires less code. Also DSM may
> > use mmap() if I'm not mistaken.
> >
>
> I think the mmap() idea is that you preprocess the dictionary, store the
> result in a file, and then mmap it when needed, without the expensive
> preprocessing.

Understand. I'm not againts the mmap() approach, just I have lack of
understanding mmap() benefits... Current shared Ispell approach requires
preprocessing after server restarting, and the main advantage of mmap() here
is that mmap() doesn't require preprocessing after restarting.

Speaking about the implementation.

It seems that the most appropriate place to store preprocessed files is
'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
dsm_cleanup_for_mmap() will remove them.

I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
maybe it's worth to reuse them.

I don't think so serialization to file (mmap) has not too sense. But the shared dictionary should loaded every time, and should be released every time if it is possible.Maybe there can be some background worker, that holds dictionary in memory.

Regards

Pavel
 

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Mar 07, 2018 at 01:02:07PM +0100, Pavel Stehule wrote:
> > Understand. I'm not againts the mmap() approach, just I have lack of
> > understanding mmap() benefits... Current shared Ispell approach requires
> > preprocessing after server restarting, and the main advantage of mmap()
> > here
> > is that mmap() doesn't require preprocessing after restarting.
> >
> > Speaking about the implementation.
> >
> > It seems that the most appropriate place to store preprocessed files is
> > 'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
> > dsm_cleanup_for_mmap() will remove them.
> >
> > I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
> > maybe it's worth to reuse them.
> >
> 
> I don't think so serialization to file (mmap) has not too sense. But the
> shared dictionary should loaded every time, and should be released every
> time if it is possible.Maybe there can be some background worker, that
> holds dictionary in memory.

Do you mean that a shared dictionary should be reloaded if its .affix
and .dict files was changed? IMHO we can store last modification
timestamp of them in a preprocessed file, and then we can rebuild the
dictionary if files was changed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Pavel Stehule
Date:


2018-03-07 13:43 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
On Wed, Mar 07, 2018 at 01:02:07PM +0100, Pavel Stehule wrote:
> > Understand. I'm not againts the mmap() approach, just I have lack of
> > understanding mmap() benefits... Current shared Ispell approach requires
> > preprocessing after server restarting, and the main advantage of mmap()
> > here
> > is that mmap() doesn't require preprocessing after restarting.
> >
> > Speaking about the implementation.
> >
> > It seems that the most appropriate place to store preprocessed files is
> > 'pg_dynshmem' folder. File prefix could be 'ts_dict.', otherwise
> > dsm_cleanup_for_mmap() will remove them.
> >
> > I'm not sure about reusing dsm_impl_mmap() and dsm_impl_windows(). But
> > maybe it's worth to reuse them.
> >
>
> I don't think so serialization to file (mmap) has not too sense. But the
> shared dictionary should loaded every time, and should be released every
> time if it is possible.Maybe there can be some background worker, that
> holds dictionary in memory.

Do you mean that a shared dictionary should be reloaded if its .affix
and .dict files was changed? IMHO we can store last modification
timestamp of them in a preprocessed file, and then we can rebuild the
dictionary if files was changed.

No, it is not necessary - just there should be commands (functions)  for preload dictiory and unload dictionary.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
> 
> No, it is not necessary - just there should be commands (functions)  for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Pavel Stehule
Date:


2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
>
> No, it is not necessary - just there should be commands (functions)  for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

Postgres hasn't any subsystem based on modification time, so introduction this sensitivity, I don't see, practical.

Regards

Pavel

 

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Pavel Stehule
Date:


2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:


2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
On Wed, Mar 07, 2018 at 01:47:25PM +0100, Pavel Stehule wrote:
> > Do you mean that a shared dictionary should be reloaded if its .affix
> > and .dict files was changed? IMHO we can store last modification
> > timestamp of them in a preprocessed file, and then we can rebuild the
> > dictionary if files was changed.
> >
>
> No, it is not necessary - just there should be commands (functions)  for
> preload dictiory and unload dictionary.

Oh understood. Tomas suggested those commands too earlier. I'll
implement them. But I think it is better to track files modification time
too. Because now, without the patch, users don't have to call additional
commands to refresh their dictionaries, so without such tracking we'll
made dictionaries maintenance harder.

Postgres hasn't any subsystem based on modification time, so introduction this sensitivity, I don't see, practical.

Usually the shared dictionaries are used for complex language based fulltext. The frequence of updates of these dictionaries is less than updates PostgreSQL. The czech dictionary is same 10 years.

Regards

Pavel
 

Regards

Pavel

 

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Mar 07, 2018 at 02:12:32PM +0100, Pavel Stehule wrote:
> 2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:
> > 2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
> >> Oh understood. Tomas suggested those commands too earlier. I'll
> >> implement them. But I think it is better to track files modification time
> >> too. Because now, without the patch, users don't have to call additional
> >> commands to refresh their dictionaries, so without such tracking we'll
> >> made dictionaries maintenance harder.
> >>
> >
> > Postgres hasn't any subsystem based on modification time, so introduction
> > this sensitivity, I don't see, practical.
> >
> 
> Usually the shared dictionaries are used for complex language based
> fulltext. The frequence of updates of these dictionaries is less than
> updates PostgreSQL. The czech dictionary is same 10 years.

Agree. In this case auto reloading isn't important feature here.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 03/07/2018 02:18 PM, Arthur Zakirov wrote:
> On Wed, Mar 07, 2018 at 02:12:32PM +0100, Pavel Stehule wrote:
>> 2018-03-07 14:10 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:
>>> 2018-03-07 13:58 GMT+01:00 Arthur Zakirov <a.zakirov@postgrespro.ru>:
>>>> Oh understood. Tomas suggested those commands too earlier. I'll
>>>> implement them. But I think it is better to track files modification time
>>>> too. Because now, without the patch, users don't have to call additional
>>>> commands to refresh their dictionaries, so without such tracking we'll
>>>> made dictionaries maintenance harder.
>>>>
>>>
>>> Postgres hasn't any subsystem based on modification time, so
>>> introduction this sensitivity, I don't see, practical.
>>>
>>
>> Usually the shared dictionaries are used for complex language
>> based fulltext. The frequence of updates of these dictionaries is
>> less than updates PostgreSQL. The czech dictionary is same 10
>> years.
> 
> Agree. In this case auto reloading isn't important feature here.
> 

Arthur, what are your plans with this patch in the current CF?

It does not seem to be moving towards RFC very much, and reworking the
patch to use mmap() seems like a quite significant change late in the
CF. Which means it's likely to cause the patch get get bumped to the
next CF (2018-09).

FWIW I am not quite sure if the mmap() approach is better than what was
implemented by the patch. I'm not sure how exactly will it behave under
memory pressure (AFAIK it goes through page cache, which means random
parts of dictionaries might get evicted) or how well is it supported on
various platforms (say, Windows).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello Tomas,

Arthur, what are your plans with this patch in the current CF?

I think dsm-based approach is in good shape already and works nice.
I've planned only to improve the documentation a little. Also it seems I should change 0004 part, I found that extension upgrade scripts may be made in wrong way.
In my opinion RELOAD and UNLOAD commands can be made in next commitfest (2018-09).
Did you look it? Have you arguments about how shared memory allocation and releasing functions are made?
 

It does not seem to be moving towards RFC very much, and reworking the
patch to use mmap() seems like a quite significant change late in the
CF. Which means it's likely to cause the patch get get bumped to the
next CF (2018-09).

Agree. I have a draft version for mmap-based approach which works in platforms with mmap. In Windows it is necessary to use another API (CreateFileMapping, etc). But this approach requires more work on handling processed dictionary files (how name them, when remove).
 

FWIW I am not quite sure if the mmap() approach is better than what was
implemented by the patch. I'm not sure how exactly will it behave under
memory pressure (AFAIK it goes through page cache, which means random
parts of dictionaries might get evicted) or how well is it supported on
various platforms (say, Windows).

Yes, as I wrote mmap-based approach requires more work. The only benefit I see is that you don't need to process a dictionary after server restart.  I'd vote for dsm-based approach.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 03/17/2018 05:43 AM, Arthur Zakirov wrote:
> Hello Tomas,
> 
>     Arthur, what are your plans with this patch in the current CF?
> 
> 
> I think dsm-based approach is in good shape already and works nice.
> I've planned only to improve the documentation a little. Also it seems I
> should change 0004 part, I found that extension upgrade scripts may be
> made in wrong way.
> In my opinion RELOAD and UNLOAD commands can be made in next commitfest
> (2018-09).
> Did you look it? Have you arguments about how shared memory allocation
> and releasing functions are made?
>  
> 
> 
>     It does not seem to be moving towards RFC very much, and reworking the
>     patch to use mmap() seems like a quite significant change late in the
>     CF. Which means it's likely to cause the patch get get bumped to the
>     next CF (2018-09).
> 
> 
> Agree. I have a draft version for mmap-based approach which works in
> platforms with mmap. In Windows it is necessary to use another API
> (CreateFileMapping, etc). But this approach requires more work on
> handling processed dictionary files (how name them, when remove).
>  
> 
> 
>     FWIW I am not quite sure if the mmap() approach is better than what was
>     implemented by the patch. I'm not sure how exactly will it behave under
>     memory pressure (AFAIK it goes through page cache, which means random
>     parts of dictionaries might get evicted) or how well is it supported on
>     various platforms (say, Windows).
> 
> 
> Yes, as I wrote mmap-based approach requires more work. The only
> benefit I see is that you don't need to process a dictionary after
> server restart.  I'd vote for dsm-based approach.
> 

I do agree with that. We have a working well-understood dsm-based
solution, addressing the goals initially explained in this thread.

I don't see a reason to stall this patch based on a mere assumption that
the mmap-based approach might be magically better in some unknown
aspects. It might be, but we may as well leave that as a future work.

I wonder how much of this patch would be affected by the switch from dsm
to mmap? I guess the memory limit would get mostly irrelevant (mmap
would rely on the OS to page the memory in/out depending on memory
pressure), and so would the UNLOAD/RELOAD commands (because each backend
would do it's own mmap).

In any case, I suggest to polish the dsm-based patch, and see if we can
get that one into PG11.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:
Hi,

On 2018-03-19 01:52:41 +0100, Tomas Vondra wrote:
> I do agree with that. We have a working well-understood dsm-based
> solution, addressing the goals initially explained in this thread.

Well, it's also awkward and manual to use. I do think that's something
we've to pay attention to.


> I wonder how much of this patch would be affected by the switch from dsm
> to mmap? I guess the memory limit would get mostly irrelevant (mmap
> would rely on the OS to page the memory in/out depending on memory
> pressure), and so would the UNLOAD/RELOAD commands (because each backend
> would do it's own mmap).

Those seem fairly major.

Greetings,

Andres Freund


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Arthur Zakirov wrote:
> I've planned only to improve the documentation a little. Also it seems I
> should change 0004 part, I found that extension upgrade scripts may be made
> in wrong way.

I've attached new version of the patch. In this version I removed
0004-Update-tmplinit-arguments-v6.patch. In my opinion it handled
extensions upgrade in wrong way. If I'm not mistaken currently there is
no way to upgrade a template's init function signature. And I didn't
find way to change init_method(internal) to init_method(internal,
internal) within an extension's upgrade script.

Therefore I added 0002-Change-tmplinit-argument-v7.patch. Now
DictInitData struct is passed in a template's init method. It contains
necessary data: dictoptions and dictid. And there is no need to change
the method's signature.

Other parts of the patch are same, except that they use DictInitData
structure now.

On Mon, Mar 19, 2018 at 01:52:41AM +0100, Tomas Vondra wrote:
> I wonder how much of this patch would be affected by the switch from dsm
> to mmap? I guess the memory limit would get mostly irrelevant (mmap
> would rely on the OS to page the memory in/out depending on memory
> pressure), and so would the UNLOAD/RELOAD commands (because each backend
> would do it's own mmap).

I beleive mmap requires completely rewrite 0003 part of the patch and a
little changes in 0005.

> In any case, I suggest to polish the dsm-based patch, and see if we can
> get that one into PG11.

Yes we have more time in future commitfests if dsm-based patch won't be
approved.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Ildus Kurbangaliev
Date:
On Mon, 19 Mar 2018 14:06:50 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

> 
> I beleive mmap requires completely rewrite 0003 part of the patch and
> a little changes in 0005.
> 
> > In any case, I suggest to polish the dsm-based patch, and see if we
> > can get that one into PG11.  
> 
> Yes we have more time in future commitfests if dsm-based patch won't
> be approved.
> 

Hi, I'm not sure about mmap approach, it would just bring another
problems. I like the dsm approach because it's not inventing any new
files in the database, when mmap approach will possibly require new
folder in data directory and management above bunch of new files, with
additional issues related with pg_upgrade and etc. Also in dsm approach
if someone needs to update dictionaries then he (or his package
manager) can just replace files and be done with it.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 03/19/2018 02:34 AM, Andres Freund wrote:
> Hi,
> 
> On 2018-03-19 01:52:41 +0100, Tomas Vondra wrote:
>> I do agree with that. We have a working well-understood dsm-based
>> solution, addressing the goals initially explained in this thread.
> 
> Well, it's also awkward and manual to use. I do think that's
> something we've to pay attention to.
> 

Awkward in what sense?

I don't think the manual aspect is an issue. Currently we have no way to
reload the dictionary, except for restarting all the backends. I don't
see that as a particularly convenient solution. Also, this is pretty
much how the shared_ispell extension works, although you might argue
that was more due to the limitation of how shared memory could be used
in extensions before DSM was introduced. In any case, I've never heard
complaints about this aspect of the extension.

There are about two things that might be automated - reloading of
dictionaries and evicting them when hitting the memory limit. I have
tried to implement that in the shared_ispell dictionary but it's a bit
more complicated than it looks.

For example, it seems obvious to reload the dictionary when the file
timestamp changes. But in fact there are three files - dict, affixes,
stopwords. So will you reload when a single file changes? All of them?
Keep in mind that the new version of dictionary may use different
affixes, so a reload at the wrong moment may result in broken result.

> 
>> I wonder how much of this patch would be affected by the switch 
>> from dsm to mmap? I guess the memory limit would get mostly 
>> irrelevant (mmap would rely on the OS to page the memory in/out 
>> depending on memory pressure), and so would the UNLOAD/RELOAD 
>> commands (because each backend would do it's own mmap).
> 
> Those seem fairly major.
> 

I'm not sure I'd say those are major. And you might also see the lack of
these capabilities as negative points for the mmap approach.

So, I'm not at all convinced the mmap approach is actually better than
the dsm one. And I believe that if we come up with a good way to
automate some of the tasks, I don't see why would that be possible in
the mmap and not dsm.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:
On 2018-03-19 14:52:34 +0100, Tomas Vondra wrote:
> On 03/19/2018 02:34 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-03-19 01:52:41 +0100, Tomas Vondra wrote:
> >> I do agree with that. We have a working well-understood dsm-based
> >> solution, addressing the goals initially explained in this thread.
> > 
> > Well, it's also awkward and manual to use. I do think that's
> > something we've to pay attention to.
> > 
> 
> Awkward in what sense?

You've to manually configure a setting that can only be set at server
start.  You can't set it as big as necessary because it might use up
memory better used for other things.  It needs the full space for
dictionaries even if the majority of it never will be needed.  All of
those aren't needed in an mmap world.


> So, I'm not at all convinced the mmap approach is actually better than
> the dsm one. And I believe that if we come up with a good way to
> automate some of the tasks, I don't see why would that be possible in
> the mmap and not dsm.

To me it seems we'll end up needing a heck of a lot more code that the
OS already implements if we do it ourselves.

Greetings,

Andres Freund


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 03/19/2018 07:07 PM, Andres Freund wrote:
> On 2018-03-19 14:52:34 +0100, Tomas Vondra wrote:
>> On 03/19/2018 02:34 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-03-19 01:52:41 +0100, Tomas Vondra wrote:
>>>> I do agree with that. We have a working well-understood dsm-based
>>>> solution, addressing the goals initially explained in this thread.
>>>
>>> Well, it's also awkward and manual to use. I do think that's
>>> something we've to pay attention to.
>>>
>>
>> Awkward in what sense?
> 
> You've to manually configure a setting that can only be set at server
> start.  You can't set it as big as necessary because it might use up
> memory better used for other things.  It needs the full space for
> dictionaries even if the majority of it never will be needed.  All of
> those aren't needed in an mmap world.
> 

Which is not quite true, because that's not what the patch does.

Each dictionary is loaded into a separate dsm segment when needed, which
is then stored in a dhash table. So most of what you wrote is not really
true - the patch does not pre-allocate the space, and the setting might
be set even after server start (it's not defined like that currently,
but that should be trivial to change).

> 
>> So, I'm not at all convinced the mmap approach is actually better
>> than the dsm one. And I believe that if we come up with a good way
>> to automate some of the tasks, I don't see why would that be
>> possible in the mmap and not dsm.
> 
> To me it seems we'll end up needing a heck of a lot more code that
> the OS already implements if we do it ourselves.
> 

Like what? Which features do you expect to need much more code?

The automated reloading will need a fairly small amount of code - the
main issue is deciding when to reload, and as I mentioned before that's
more complicated than you seem to believe. In fact, it may not even be
possible - there's no way to decide if all files are already updated.
Currently we kinda ignore that, on the assumption that dictionaries
change only rarely. We may do the same thing and reload the dict if at
least one file changes. In any case, the amount of code is trivial.

In fact, it may be more complicated in the mmap case - how do you update
a dictionary that is already mapped to multiple processes?

The eviction is harder - I'll give you that. But then again, I'm not
sure the mmap approach is really what we want here - it seems better to
evict the whole dictionary, than some random pages from many of them.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Mon, Mar 19, 2018 at 07:40:54PM +0100, Tomas Vondra wrote:
> 
> 
> On 03/19/2018 07:07 PM, Andres Freund wrote:
> > You've to manually configure a setting that can only be set at server
> > start.  You can't set it as big as necessary because it might use up
> > memory better used for other things.  It needs the full space for
> > dictionaries even if the majority of it never will be needed.  All of
> > those aren't needed in an mmap world.
> > 
> 
> Which is not quite true, because that's not what the patch does.
> 
> Each dictionary is loaded into a separate dsm segment when needed, which
> is then stored in a dhash table. So most of what you wrote is not really
> true - the patch does not pre-allocate the space, and the setting might
> be set even after server start (it's not defined like that currently,
> but that should be trivial to change).

Oh, it's true. I had plans to fix it but somehow I forgot to allow to change
max_shared_dictionaries_size GUC via pg_reload_conf(). I'll fix it and
will send new version of the patch.

> > To me it seems we'll end up needing a heck of a lot more code that
> > the OS already implements if we do it ourselves.
> > 
> 
> Like what? Which features do you expect to need much more code?
> 
> The automated reloading will need a fairly small amount of code - the
> main issue is deciding when to reload, and as I mentioned before that's
> more complicated than you seem to believe. In fact, it may not even be
> possible - there's no way to decide if all files are already updated.
> Currently we kinda ignore that, on the assumption that dictionaries
> change only rarely. We may do the same thing and reload the dict if at
> least one file changes. In any case, the amount of code is trivial.
> 
> In fact, it may be more complicated in the mmap case - how do you update
> a dictionary that is already mapped to multiple processes?
> 
> The eviction is harder - I'll give you that. But then again, I'm not
> sure the mmap approach is really what we want here - it seems better to
> evict the whole dictionary, than some random pages from many of them.

Agree. mmap approach requires same code plus code to handle cache files,
which will be mapped into memory. In mmap approach we need to solve same
issues we face and more. Also we need somehow automatically reload
dictionaries in both cases.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
Hi Arthur,

I went through the patch - just skimming through the diffs, will do more
testing tomorrow. Here are a few initial comments.

1) max_shared_dictionaries_size / PGC_POSTMASTER

I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
can't be changed after server start. That seems like a fairly useful
thing to do (e.g. increase the limit while the server is running), and
after looking at the code I think it shouldn't be difficult to change.

The other thing I'd suggest is handling "-1" as "no limit".


2) max_shared_dictionaries_size / size of number

Some of the comments dealing with the GUC treat it as a number of
dictionaries (instead of a size). I suppose that's due to how the
original patch was implemented.


3) Assert(max_shared_dictionaries_size);

I'd say that assert is not very clear - it should be

    Assert(max_shared_dictionaries_size > 0);

or something along the lines. It's also a good idea to add a comment
explaining the assert, say

    /* we can only get here when shared dictionaries are enabled */
    Assert(max_shared_dictionaries_size > 0);

4) I took the liberty of rewording some of the docs/comments. See the
attached diffs, that should apply on top of 0003 and 0004 patches.
Please, treat those as mere suggestions.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

On Mon, Mar 19, 2018 at 08:50:46PM +0100, Tomas Vondra wrote:
> Hi Arthur,
> 
> I went through the patch - just skimming through the diffs, will do more
> testing tomorrow. Here are a few initial comments.

Thank you for the review!

> 1) max_shared_dictionaries_size / PGC_POSTMASTER
> 
> I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
> can't be changed after server start. That seems like a fairly useful
> thing to do (e.g. increase the limit while the server is running), and
> after looking at the code I think it shouldn't be difficult to change.

max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
of a new value to disallow to set zero if there are loaded dictionaries
and to decrease maximum allowed size if loaded size is greater than the
new value.

> The other thing I'd suggest is handling "-1" as "no limit".

I added availability to set '-1'. Fixed some comments and the
documentation.

> 2) max_shared_dictionaries_size / size of number
> 
> Some of the comments dealing with the GUC treat it as a number of
> dictionaries (instead of a size). I suppose that's due to how the
> original patch was implemented.

Fixed. Should be good now.

> 3) Assert(max_shared_dictionaries_size);
> 
> I'd say that assert is not very clear - it should be
> 
>     Assert(max_shared_dictionaries_size > 0);
> 
> or something along the lines. It's also a good idea to add a comment
> explaining the assert, say
> 
>     /* we can only get here when shared dictionaries are enabled */
>     Assert(max_shared_dictionaries_size > 0);

Fixed the assert and added the comment. I extended the assert, it also
takes into account -1 value.

> 4) I took the liberty of rewording some of the docs/comments. See the
> attached diffs, that should apply on top of 0003 and 0004 patches.
> Please, treat those as mere suggestions.

I applied your diffs and added changes to max_shared_dictionaries_size.

Please find the attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 03/20/2018 02:11 PM, Arthur Zakirov wrote:
> Hello,
> 
> On Mon, Mar 19, 2018 at 08:50:46PM +0100, Tomas Vondra wrote:
>> Hi Arthur,
>>
>> I went through the patch - just skimming through the diffs, will do more
>> testing tomorrow. Here are a few initial comments.
> 
> Thank you for the review!
> 
>> 1) max_shared_dictionaries_size / PGC_POSTMASTER
>>
>> I'm not quite sure why the GUC is defined as PGC_POSTMASTER, i.e. why it
>> can't be changed after server start. That seems like a fairly useful
>> thing to do (e.g. increase the limit while the server is running), and
>> after looking at the code I think it shouldn't be difficult to change.
> 
> max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
> of a new value to disallow to set zero if there are loaded dictionaries
> and to decrease maximum allowed size if loaded size is greater than the
> new value.
> 

I wonder if these restrictions needed? I mean, why not to allow setting
max_shared_dictionaries_size below the size of loaded dictionaries?

Of course, on the one hand those restriction seem sensible. On the other
hand, perhaps in some cases it would be useful to allow violating them?

I mean, why not to simply disable loading of new dictionaries when

    (max_shared_dictionaries_size < loaded_size)

Maybe I'm over-thinking this though. It's probably safer and less
surprising to enforce the restrictions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Tue, Mar 20, 2018 at 09:30:15PM +0100, Tomas Vondra wrote:
> On 03/20/2018 02:11 PM, Arthur Zakirov wrote:
> > max_shared_dictionaries_size is defined as PGC_SIGHUP now. Added check
> > of a new value to disallow to set zero if there are loaded dictionaries
> > and to decrease maximum allowed size if loaded size is greater than the
> > new value.
> > 
> 
> I wonder if these restrictions needed? I mean, why not to allow setting
> max_shared_dictionaries_size below the size of loaded dictionaries?
> 
> Of course, on the one hand those restriction seem sensible. On the other
> hand, perhaps in some cases it would be useful to allow violating them?
> 
> I mean, why not to simply disable loading of new dictionaries when
> 
>     (max_shared_dictionaries_size < loaded_size)
> 
> Maybe I'm over-thinking this though. It's probably safer and less
> surprising to enforce the restrictions.

Hm, yes in some cases this check may be over-engineering. I thought that
it is reasonable and safer in v7 patch. But there are similar GUCs,
wal_keep_segments and max_wal_size, which don't do additional checks.
And people are fine with them. So I removed that check from the variable.

Please find the attached new version of the patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, Mar 21, 2018 at 12:00:52PM +0300, Arthur Zakirov wrote:
> On Tue, Mar 20, 2018 at 09:30:15PM +0100, Tomas Vondra wrote:
> > I wonder if these restrictions needed? I mean, why not to allow setting
> > max_shared_dictionaries_size below the size of loaded dictionaries?
> > 
> > Of course, on the one hand those restriction seem sensible. On the other
> > hand, perhaps in some cases it would be useful to allow violating them?
> > 
> > I mean, why not to simply disable loading of new dictionaries when
> > 
> >     (max_shared_dictionaries_size < loaded_size)
> > 
> > Maybe I'm over-thinking this though. It's probably safer and less
> > surprising to enforce the restrictions.
> 
> Hm, yes in some cases this check may be over-engineering. I thought that
> it is reasonable and safer in v7 patch. But there are similar GUCs,
> wal_keep_segments and max_wal_size, which don't do additional checks.
> And people are fine with them. So I removed that check from the variable.
> 
> Please find the attached new version of the patch.

I forgot to fix regression tests for max_shared_dictionaries_size. Also
I'm not confident about using pg_reload_conf() in regression tests.
I haven't found where pg_reload_conf() is used in tests. So I removed
max_shared_dictionaries_size tests for now.

Sorry for the noise.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> [ v10 patch versions ]

I took a quick look through this.  I agree with the comments about
mmap-ability not being something we should insist on now, and maybe
not ever.  However, in order to keep our options open, it seems like
we should minimize the amount of API we expose that's based on the
current implementation.  That leads me to the following thoughts:

* I cannot imagine a use-case for setting max_shared_dictionaries_size
to anything except "unlimited".  If it's not that, and you exceed it,
then subsequent backends load private copies of the dictionary, making
your memory situation rapidly worse not better.  I think we should lose
that GUC altogether and just load dictionaries automatically.

* Similarly, I see no point in a "sharable" option on individual
dictionaries, especially when there's only one allowed setting for
most dictionary types.  Let's lose that too.

* And that leads us to not particularly need a view telling which
dictionaries are loaded, either.  It's just an implementation detail
that users don't need to worry about.

This does beg the question of whether we need a way to flush dictionary
contents that's short of restarting the server (or short of dropping and
recreating the dictionary).  I'm not sure, but even if we do, none of
the above is necessary for that.

I do think it's required that changing the dictionary's options with
ALTER TEXT SEARCH DICTIONARY automatically cause a reload; but if that's
happening with this patch, I don't see where.  (It might work to use
the combination of dictionary OID and TID of the dictionary's pg_ts_dict
tuple as the lookup key for shared dictionaries.  Oh, and have you
thought about the possibility of conflicting OIDs in different DBs?
Probably the database OID has to be part of the key, as well.)

Also, the scheme for releasing the dictionary DSM during
RemoveTSDictionaryById is uncertain and full of race conditions:
the DROP might roll back later, or someone might come along and
start using the dictionary (causing a fresh DSM load) before the
DROP commits and makes the dictionary invisible to other sessions.
I don't think that either of those are necessarily fatal objections,
but there needs to be some commentary there explaining what happens.

BTW, I was going to complain that this patch alters the API for
dictionary template init functions without any documentation updates;
but then I realized that there isn't any documentation to update.
That pretty well sucks, but I suppose it's not the job of this patch
to improve that situation.  Still, you could spend a bit more effort on
the commentary in ts_public.h in 0002, because that commentary is as
close to an API spec as we've got.

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 3/24/18 9:56 PM, Tom Lane wrote:
> Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
>> [ v10 patch versions ]
> 
> I took a quick look through this.  I agree with the comments about
> mmap-ability not being something we should insist on now, and maybe
> not ever.  However, in order to keep our options open, it seems like
> we should minimize the amount of API we expose that's based on the
> current implementation.  That leads me to the following thoughts:
> 
> * I cannot imagine a use-case for setting max_shared_dictionaries_size
> to anything except "unlimited".  If it's not that, and you exceed it,
> then subsequent backends load private copies of the dictionary, making
> your memory situation rapidly worse not better.  I think we should lose
> that GUC altogether and just load dictionaries automatically.
> 

Introduction of that limit is likely my fault. It came from from an
extension I wrote a long time ago, but back then it was a necessity
because we did not have DSM. So in retrospect I agree with you - it's
not particularly useful and we should ditch it.

Arthur, let this be a lesson for you! You have to start fight against
bogus feature requests from other people ;-)

> * Similarly, I see no point in a "sharable" option on individual
> dictionaries, especially when there's only one allowed setting for
> most dictionary types.  Let's lose that too.
> 

I'm not so sure. Imagine you have a small number of dictionaries that
are used frequently, and then many that are used only once in a while.

A good example is a system handling documents in various languages,
serving "local" customers in a few local languages most of the time, but
then once in a while there's a request in another language.

In that case it makes sense to keep the frequently used ones in shared
memory all the time, and the rest load as needed (and throw it away when
the backend disconnects). Which is exactly what the 'shareable' option
is about ...

> * And that leads us to not particularly need a view telling which
> dictionaries are loaded, either.  It's just an implementation detail
> that users don't need to worry about.
> 

Not so sure about this either. Of course, if we remove the memory limit,
it will be predictable which dictionaries are loaded in shared memory
and which are in the backends.

> This does beg the question of whether we need a way to flush dictionary
> contents that's short of restarting the server (or short of dropping and
> recreating the dictionary).  I'm not sure, but even if we do, none of
> the above is necessary for that.
> 

Ummm, I don't follow. Why would be flushing a dictionary similar to
restarting a server? It's certainly simpler to do a RELOAD on an
existing dictionary than having to do DROP+CREATE. I would not be
surprised if DROP+CREATE was significantly harder process-wise for many
admins (I mean, more approvals).

>
> I do think it's required that changing the dictionary's options with
> ALTER TEXT SEARCH DICTIONARY automatically cause a reload; but if that's
> happening with this patch, I don't see where.  (It might work to use
> the combination of dictionary OID and TID of the dictionary's pg_ts_dict
> tuple as the lookup key for shared dictionaries.  Oh, and have you
> thought about the possibility of conflicting OIDs in different DBs?
> Probably the database OID has to be part of the key, as well.)
> 

Not sure.

> Also, the scheme for releasing the dictionary DSM during
> RemoveTSDictionaryById is uncertain and full of race conditions:
> the DROP might roll back later, or someone might come along and
> start using the dictionary (causing a fresh DSM load) before the
> DROP commits and makes the dictionary invisible to other sessions.
> I don't think that either of those are necessarily fatal objections,
> but there needs to be some commentary there explaining what happens.
> 

Actually, I think that's an issue - such race condition might easily
leak the shared memory forever (because the new dictionary will get a
different OID etc.). It probably is not happening very often, because
dictionaries are not dropped very often. But it needs fixing I think.

> BTW, I was going to complain that this patch alters the API for
> dictionary template init functions without any documentation updates;
> but then I realized that there isn't any documentation to update.
> That pretty well sucks, but I suppose it's not the job of this patch
> to improve that situation.  Still, you could spend a bit more effort on
> the commentary in ts_public.h in 0002, because that commentary is as
> close to an API spec as we've got.
> 

Yeah :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 3/24/18 9:56 PM, Tom Lane wrote:
>> Also, the scheme for releasing the dictionary DSM during
>> RemoveTSDictionaryById is uncertain and full of race conditions:
>> the DROP might roll back later, or someone might come along and
>> start using the dictionary (causing a fresh DSM load) before the
>> DROP commits and makes the dictionary invisible to other sessions.
>> I don't think that either of those are necessarily fatal objections,
>> but there needs to be some commentary there explaining what happens.

> Actually, I think that's an issue - such race condition might easily
> leak the shared memory forever (because the new dictionary will get a
> different OID etc.). It probably is not happening very often, because
> dictionaries are not dropped very often. But it needs fixing I think.

My thought was (a) the ROLLBACK case is ok, because the next use of
the dictionary will reload it, and (b) the reload-concurrently-with-
DROP case is annoying, because indeed it leaks, but the window is small
and it probably won't be an issue in practice.  We would need to be
sure that the DSM segment goes away at postmaster restart, but given
that I think it'd be tolerable.  Of course it'd be better not to have
the race, but I see no easy way to prevent it -- do you?

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 03/25/2018 06:18 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 3/24/18 9:56 PM, Tom Lane wrote:
>>> Also, the scheme for releasing the dictionary DSM during
>>> RemoveTSDictionaryById is uncertain and full of race conditions:
>>> the DROP might roll back later, or someone might come along and
>>> start using the dictionary (causing a fresh DSM load) before the
>>> DROP commits and makes the dictionary invisible to other sessions.
>>> I don't think that either of those are necessarily fatal objections,
>>> but there needs to be some commentary there explaining what happens.
> 
>> Actually, I think that's an issue - such race condition might easily
>> leak the shared memory forever (because the new dictionary will get a
>> different OID etc.). It probably is not happening very often, because
>> dictionaries are not dropped very often. But it needs fixing I think.
> 
> My thought was (a) the ROLLBACK case is ok, because the next use of
> the dictionary will reload it, and (b) the reload-concurrently-with-
> DROP case is annoying, because indeed it leaks, but the window is small
> and it probably won't be an issue in practice.  We would need to be
> sure that the DSM segment goes away at postmaster restart, but given
> that I think it'd be tolerable.  Of course it'd be better not to have
> the race, but I see no easy way to prevent it -- do you?
> 

Unfortunately no :( For a moment I thought that perhaps we could make it
a responsibility of the last user of the dictionary - set a flag in the
shared memory, and the last user would remove it. But the trouble is how
to decide who's the last one? Even with some simple reference counting,
we probably don't know if that's the really last one.

FWIW this is where the view listing dictionaries loaded into shared
memory would be helpful - you'd at least know there's a dictionary,
wasting memory.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> FWIW this is where the view listing dictionaries loaded into shared
> memory would be helpful - you'd at least know there's a dictionary,
> wasting memory.

Well, that's only because we failed to make the implementation transparent
:-(.  But it's not unlikely that an mmap-based implementation would be
simply incapable of supporting such a view: the knowledge of whether a
particular file is mapped in would be pretty much process-local, I think.
So I'd really rather we don't add that.

Also, while these dictionaries are indeed kind of large relative to our
traditional view of shared memory, if they're in DSM segments that the
kernel can swap out then I really suspect that nobody would much care
if a few such segments had been leaked.  I find it hard to imagine a
use-case where DROP race conditions would lead us to leak so many that
it becomes a serious problem.  Maybe I lack imagination.

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sat, Mar 24, 2018 at 04:56:36PM -0400, Tom Lane wrote:
> Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> > [ v10 patch versions ]

Thank you for the review, Tom!

Tomas Vondra wrote:
> Tom Lane wrote:
>> * I cannot imagine a use-case for setting max_shared_dictionaries_size
>> to anything except "unlimited".  If it's not that, and you exceed it,
>> then subsequent backends load private copies of the dictionary, making
>> your memory situation rapidly worse not better.  I think we should lose
>> that GUC altogether and just load dictionaries automatically.
>
> Introduction of that limit is likely my fault. It came from from an
> extension I wrote a long time ago, but back then it was a necessity
> because we did not have DSM. So in retrospect I agree with you - it's
> not particularly useful and we should ditch it.
> 
> Arthur, let this be a lesson for you! You have to start fight against
> bogus feature requests from other people ;-)

Yeah, in this sense max_shared_dictionaries_size is pointless. I'll
remove it then :).

> * Similarly, I see no point in a "sharable" option on individual
> dictionaries, especially when there's only one allowed setting for
> most dictionary types.  Let's lose that too.

I think "Shareable" option could be useful if a shared dictionary
building time was much longer than a non-shared dictionary building
time. It is slightly longer because of additional memcpy(), but isn't
noticable I think. So it is worth to remove it.

> * And that leads us to not particularly need a view telling which
> dictionaries are loaded, either.  It's just an implementation detail
> that users don't need to worry about.

If all dictionaries will be shareable then this view could be removed.
Unfortunately I think it can't help with leaked segments, I didn't find
a way to iterate dshash entries. That's why pg_ts_shared_dictionaries()
scans pg_ts_dict table instead of scanning dshash table.

> I do think it's required that changing the dictionary's options with
> ALTER TEXT SEARCH DICTIONARY automatically cause a reload; but if that's
> happening with this patch, I don't see where.  (It might work to use
> the combination of dictionary OID and TID of the dictionary's pg_ts_dict
> tuple as the lookup key for shared dictionaries.  Oh, and have you
> thought about the possibility of conflicting OIDs in different DBs?
> Probably the database OID has to be part of the key, as well.)

Yes unfortunately ALTER TEXT SEARCH DICTIONARY doesn't reload a
dictionary. TID can help here. I thought about using XID too when I
started to work on RELOAD command. But I'm not sure that it is a good
idea, anyway XID isn't needed in current version.

> Also, the scheme for releasing the dictionary DSM during
> RemoveTSDictionaryById is uncertain and full of race conditions:
> the DROP might roll back later, or someone might come along and
> start using the dictionary (causing a fresh DSM load) before the
> DROP commits and makes the dictionary invisible to other sessions.
> I don't think that either of those are necessarily fatal objections,
> but there needs to be some commentary there explaining what happens.

I missed this case. As you wrote below ROLLBACK case is ok. But I
haven't a soluton for the second case for now. If I won't solve it I'll
add additional comments in RemoveTSConfigurationById() and maybe in the
documentation if it's appropriate.

> BTW, I was going to complain that this patch alters the API for
> dictionary template init functions without any documentation updates;
> but then I realized that there isn't any documentation to update.
> That pretty well sucks, but I suppose it's not the job of this patch
> to improve that situation.  Still, you could spend a bit more effort on
> the commentary in ts_public.h in 0002, because that commentary is as
> close to an API spec as we've got.

I'll fix the comments.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sun, Mar 25, 2018 at 06:45:08AM +0200, Tomas Vondra wrote:
> FWIW this is where the view listing dictionaries loaded into shared
> memory would be helpful - you'd at least know there's a dictionary,
> wasting memory.

Unfortunately, It seems that this view can't help in listing leaked
segments. I didn't find a way to list dshash entries. Now
pg_ts_shared_dictionaries() scans pg_ts_dict table and gets a dshash
item using dictId. In case of leaked dictionaries we don't know their
identifiers.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> On Sat, Mar 24, 2018 at 04:56:36PM -0400, Tom Lane wrote:
>> * And that leads us to not particularly need a view telling which
>> dictionaries are loaded, either.  It's just an implementation detail
>> that users don't need to worry about.

> If all dictionaries will be shareable then this view could be removed.
> Unfortunately I think it can't help with leaked segments, I didn't find
> a way to iterate dshash entries. That's why pg_ts_shared_dictionaries()
> scans pg_ts_dict table instead of scanning dshash table.

If you're scanning pg_ts_dict, what happens with dictionaries belonging
to other databases?  They won't be visible in your local copy of
pg_ts_dict.  Between that and the inability to find leaked segments,
I'm not seeing that this has much use-case.

>> (It might work to use
>> the combination of dictionary OID and TID of the dictionary's pg_ts_dict
>> tuple as the lookup key for shared dictionaries.  Oh, and have you
>> thought about the possibility of conflicting OIDs in different DBs?
>> Probably the database OID has to be part of the key, as well.)

> Yes unfortunately ALTER TEXT SEARCH DICTIONARY doesn't reload a
> dictionary. TID can help here. I thought about using XID too when I
> started to work on RELOAD command. But I'm not sure that it is a good
> idea, anyway XID isn't needed in current version.

Actually, existing practice is to check both xmin and tid; see for example
where plpgsql checks if a cached function data structure still matches the
pg_proc row, pl_comp.c around line 175 in HEAD.  The other PLs do it
similarly I think.  I'm not sure offhand just how much that changes the
risks of a false match compared to testing only one of these fields, but
I'd recommend conforming to the way it's done elsewhere.

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sun, Mar 25, 2018 at 12:18:10AM -0400, Tom Lane wrote:
> My thought was (a) the ROLLBACK case is ok, because the next use of
> the dictionary will reload it, and (b) the reload-concurrently-with-
> DROP case is annoying, because indeed it leaks, but the window is small
> and it probably won't be an issue in practice.  We would need to be
> sure that the DSM segment goes away at postmaster restart, but given
> that I think it'd be tolerable.  Of course it'd be better not to have
> the race, but I see no easy way to prevent it -- do you?

I'm not sure that I understood the second case correclty. Can cache
invalidation help in this case? I don't have confident knowledge of cache
invalidation. It seems to me that InvalidateTSCacheCallBack() should
release segment after commit.

But cache isn't invalidated if a backend was terminated after a
dictionary reloading. on_shmem_exit() could help, but we need a leaked
dictionaries list for that.

P.S. I think it isn't right to release all dictionaries segment in
InvalidateTSCacheCallBack(). Otherwise any DROP can release all
segments. It would be worth to release a specific dictionary.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Sun, Mar 25, 2018 at 02:28:29PM -0400, Tom Lane wrote:
> Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> > If all dictionaries will be shareable then this view could be removed.
> > Unfortunately I think it can't help with leaked segments, I didn't find
> > a way to iterate dshash entries. That's why pg_ts_shared_dictionaries()
> > scans pg_ts_dict table instead of scanning dshash table.
> 
> If you're scanning pg_ts_dict, what happens with dictionaries belonging
> to other databases?  They won't be visible in your local copy of
> pg_ts_dict.  Between that and the inability to find leaked segments,
> I'm not seeing that this has much use-case.

Indeed pg_ts_dict scanning is wrong way here. And
pg_ts_shared_dictionaries() is definitely broken.

> > Yes unfortunately ALTER TEXT SEARCH DICTIONARY doesn't reload a
> > dictionary. TID can help here. I thought about using XID too when I
> > started to work on RELOAD command. But I'm not sure that it is a good
> > idea, anyway XID isn't needed in current version.
> 
> Actually, existing practice is to check both xmin and tid; see for example
> where plpgsql checks if a cached function data structure still matches the
> pg_proc row, pl_comp.c around line 175 in HEAD.  The other PLs do it
> similarly I think.  I'm not sure offhand just how much that changes the
> risks of a false match compared to testing only one of these fields, but
> I'd recommend conforming to the way it's done elsewhere.

Thank you for pointing to it! I think it shouldn't be hard to use both
xmin and tid.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> On Sun, Mar 25, 2018 at 12:18:10AM -0400, Tom Lane wrote:
>> My thought was (a) the ROLLBACK case is ok, because the next use of
>> the dictionary will reload it, and (b) the reload-concurrently-with-
>> DROP case is annoying, because indeed it leaks, but the window is small
>> and it probably won't be an issue in practice.  We would need to be
>> sure that the DSM segment goes away at postmaster restart, but given
>> that I think it'd be tolerable.  Of course it'd be better not to have
>> the race, but I see no easy way to prevent it -- do you?

> I'm not sure that I understood the second case correclty. Can cache
> invalidation help in this case? I don't have confident knowledge of cache
> invalidation. It seems to me that InvalidateTSCacheCallBack() should
> release segment after commit.

"Release after commit" sounds like a pretty dangerous design to me,
because a release necessarily implies some kernel calls, which could
fail.  We can't afford to inject steps that might fail into post-commit
cleanup (because it's too late to recover by failing the transaction).
It'd be better to do cleanup while searching for a dictionary to use.

I assume the DSM infrastructure already has some solution for getting
rid of DSM segments when the last interested process disconnects,
so maybe you could piggyback on that somehow.

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Mon, Mar 26, 2018 at 11:27:48AM -0400, Tom Lane wrote:
> Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> > I'm not sure that I understood the second case correclty. Can cache
> > invalidation help in this case? I don't have confident knowledge of cache
> > invalidation. It seems to me that InvalidateTSCacheCallBack() should
> > release segment after commit.
> 
> "Release after commit" sounds like a pretty dangerous design to me,
> because a release necessarily implies some kernel calls, which could
> fail.  We can't afford to inject steps that might fail into post-commit
> cleanup (because it's too late to recover by failing the transaction).
> It'd be better to do cleanup while searching for a dictionary to use.
> 
> I assume the DSM infrastructure already has some solution for getting
> rid of DSM segments when the last interested process disconnects,
> so maybe you could piggyback on that somehow.

Yes, there is dsm_pin_mapping() for this. But it is necessary to keep a
segment even if there are no attached processes. From 0003:

+    /* Remain attached until end of postmaster */
+    dsm_pin_segment(seg);
+    /* Remain attached until end of session */
+    dsm_pin_mapping(seg);

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Please find the attached new version of the patch.

I got rid of 0005 and 0006 parts. There are no
max_shared_dictionaries_size variable, Shareable option,
pg_ts_shared_dictionaries view anymore.

On Sat, Mar 24, 2018 at 04:56:36PM -0400, Tom Lane wrote:
> I do think it's required that changing the dictionary's options with
> ALTER TEXT SEARCH DICTIONARY automatically cause a reload; but if that's
> happening with this patch, I don't see where.  (It might work to use
> the combination of dictionary OID and TID of the dictionary's pg_ts_dict
> tuple as the lookup key for shared dictionaries.  Oh, and have you
> thought about the possibility of conflicting OIDs in different DBs?
> Probably the database OID has to be part of the key, as well.)

The database OID, the dictionary OID, TID and XMIN are used now as
lookup key.

> Also, the scheme for releasing the dictionary DSM during
> RemoveTSDictionaryById is uncertain and full of race conditions:
> the DROP might roll back later, or someone might come along and
> start using the dictionary (causing a fresh DSM load) before the
> DROP commits and makes the dictionary invisible to other sessions.
> I don't think that either of those are necessarily fatal objections,
> but there needs to be some commentary there explaining what happens.

The dictionary's DSM segment is alive till postmaster terminates now.
But when the dictionary is dropped or altered then the previous
(invalid) segment is unpinned. The segment itself is released when all
backends unpins mapping in lookup_ts_parser_cache() or by disconnecting.

The problem here comes when the dictionary was used before dropping or
altering by some process, isn't used after and the process lives a very
long time. In this situation the mapping isn't unpinned and the segment
isn't released. The other problem is that TsearchDictEntry isn't removed
if ts_dict_shmem_release() wasn't called. It may happen after dropping
the dictionary.

> BTW, I was going to complain that this patch alters the API for
> dictionary template init functions without any documentation updates;
> but then I realized that there isn't any documentation to update.
> That pretty well sucks, but I suppose it's not the job of this patch
> to improve that situation.  Still, you could spend a bit more effort on
> the commentary in ts_public.h in 0002, because that commentary is as
> close to an API spec as we've got.

I improved a little bit the commentary in ts_public.h.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Here is the new version of the patch.

Now RemoveTSDictionaryById() and AlterTSDictionary() unpin the
dictionary DSM segment. So if all attached backends disconnect allocated
DSM segments will be released.

lookup_ts_dictionary_cache() may unping DSM mapping for all invalid
dictionary cache entries.

I added xmax in DictPointerData. It is used as a lookup key now too. It
helps to reload a dictionary after roll back DROP command.

There was a bug in ts_dict_shmem_location(), I fixed it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello all,

I'd like to add new optional function to text search template named fini in addition to init() and lexize(). It will be called by RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will call ts_dict_shmem_release().

It doesn't change segments leaking situation. I think it makes text search API more transparent.

I'll update the existing documentation. And I think I can add text search API documentation in the 2018-09 commitfest, as Tom noticed that it doesn't exist.

Any thoughts?


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 03/31/2018 12:42 PM, Arthur Zakirov wrote:
> Hello all,
> 
> I'd like to add new optional function to text search template named fini
> in addition to init() and lexize(). It will be called by
> RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will
> call ts_dict_shmem_release().
> 
> It doesn't change segments leaking situation. I think it makes text
> search API more transparent.
> 

If it doesn't actually solve the problem, why add it? I don't see a
point in adding functions for the sake of transparency, when it does not
in fact serve any use cases.

Can't we handle the segment-leaking by adding some sort of tombstone?

For example, imagine that instead of removing the hash table entry we
mark it as 'dropped'. And after that, after the lookup we would know the
dictionary was removed, and the backends would load the dictionary into
their private memory.

Of course, this could mean we end up having many tombstones in the hash
table. But those tombstones would be tiny, making it less painful than
potentially leaking much more memory for the dictionaries.

Also, I wonder if we might actually remove the dictionaries after a
while, e.g. based on XID. Imagine that we note the XID of the
transaction removing the dictionary, or perhaps XID of the most recent
running transaction. Then we could use this to decide if all running
transactions actually see the DROP, and we could remove the tombstone.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Tomas Vondra wrote:
On 03/31/2018 12:42 PM, Arthur Zakirov wrote:
> Hello all,
>
> I'd like to add new optional function to text search template named fini
> in addition to init() and lexize(). It will be called by
> RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will
> call ts_dict_shmem_release().
>
> It doesn't change segments leaking situation. I think it makes text
> search API more transparent.
>

If it doesn't actually solve the problem, why add it? I don't see a
point in adding functions for the sake of transparency, when it does not
in fact serve any use cases.

It doesn't solve the problem. But it brings more clearness, if a dictionary requested shared location then it should release/unpin it. There are no such scenario yet, but someone might want to release not only shared segment but also other private data.

Can't we handle the segment-leaking by adding some sort of tombstone?

It is interesting that there are such tombstones already, without the patch. TSDictionaryCacheEntry entries aren't deleted after DROP, they are just marked isvalid = false.
 
For example, imagine that instead of removing the hash table entry we
mark it as 'dropped'. And after that, after the lookup we would know the
dictionary was removed, and the backends would load the dictionary into
their private memory.

Of course, this could mean we end up having many tombstones in the hash
table. But those tombstones would be tiny, making it less painful than
potentially leaking much more memory for the dictionaries.

Now actually Isn't guaranteed that the hash table entry will be removed. Even if refcnt is 0. So I think I should remove refcnt and entries won't be removed.

There are no big problems with leaking now. Memory may leak only if a dictionary was dropped or altered and there is no text search workload anymore and the backend still alive. Because next using of text search functions will unpin segments used before for invalid dictionaries (isvalid == false). Also the segment is unpinned if the backend terminates. The segment is destroyed when all interested processes unpin the segment (as Tom noticed), the hash table entry becomes tombstone.

I hope I described clear.
 
Also, I wonder if we might actually remove the dictionaries after a
while, e.g. based on XID. Imagine that we note the XID of the
transaction removing the dictionary, or perhaps XID of the most recent
running transaction. Then we could use this to decide if all running
transactions actually see the DROP, and we could remove the tombstone.

Maybe autovacuum should work here too :) It is joke of course. I'm not very aware of removing dead tuples, but I think here is similar case.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, Mar 29, 2018 at 02:03:07AM +0300, Arthur Zakirov wrote:
> Here is the new version of the patch.

Please find the attached new version of the patch.

I removed refcnt because it is useless, it doesn't guarantee that a hash
table entry will be removed.

I fixed a bug, dsm_unpin_segment() can be called twice if a transaction
which called it was aborted and another transaction calls
ts_dict_shmem_release(). I added segment_ispinned to fix it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Tue, Mar 27, 2018 at 8:19 AM, Arthur Zakirov
<a.zakirov@postgrespro.ru> wrote:
>> I assume the DSM infrastructure already has some solution for getting
>> rid of DSM segments when the last interested process disconnects,
>> so maybe you could piggyback on that somehow.
>
> Yes, there is dsm_pin_mapping() for this. But it is necessary to keep a
> segment even if there are no attached processes. From 0003:
>
> +       /* Remain attached until end of postmaster */
> +       dsm_pin_segment(seg);
> +       /* Remain attached until end of session */
> +       dsm_pin_mapping(seg);

I don't quite understand the problem you're trying to solve here, but:

1. Unless dsm_pin_segment() is called, a DSM segment will
automatically be removed when there are no remaining mappings.

2. Unless dsm_pin_mapping() is called, a DSM segment will be unmapped
when the currently-in-scope resource owner is cleaned up, like at the
end of the query.  If it is called, then the mapping will stick around
until the backend exits.

If you pin the mapping or the segment and later no longer want it
pinned, there are dsm_unpin_mapping() and dsm_unpin_segment()
functions available, too.  So it seems like what you might want to do
is pin the segment when it's created, and then unpin it if it's
stale/obsolete.  The latter won't remove it immediately, but will once
all the mappings are gone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

On Tue, May 15, 2018 at 05:02:43PM -0400, Robert Haas wrote:
> On Tue, Mar 27, 2018 at 8:19 AM, Arthur Zakirov
> <a.zakirov@postgrespro.ru> wrote:
> > Yes, there is dsm_pin_mapping() for this. But it is necessary to keep a
> > segment even if there are no attached processes. From 0003:
> >
> > +       /* Remain attached until end of postmaster */
> > +       dsm_pin_segment(seg);
> > +       /* Remain attached until end of session */
> > +       dsm_pin_mapping(seg);
> 
> I don't quite understand the problem you're trying to solve here, but:
> 
> 1. Unless dsm_pin_segment() is called, a DSM segment will
> automatically be removed when there are no remaining mappings.
> 
> 2. Unless dsm_pin_mapping() is called, a DSM segment will be unmapped
> when the currently-in-scope resource owner is cleaned up, like at the
> end of the query.  If it is called, then the mapping will stick around
> until the backend exits.

I tried to solve the case when DSM segment remains mapped even a
dictionary was dropped. It may happen in the following situation:

Backend 1:

=# select ts_lexize('english_shared', 'test');
-- The dictionary is loaded into DSM, the segment and the mapping is
pinned
...
-- Call ts_lexize() from backend 2 below
=# drop text search dictionary english_shared;
-- The segment and the mapping is unpinned, see ts_dict_shmem_release()

Backend 2:

=# select ts_lexize('english_shared', 'test');
-- The dictionary got from DSM, the mapping is pinned
...
-- The dictionary was dropped by backend 1, but the mapping still is
pinned

As you can see the DSM still is pinned by backend 2. Later I fixed it by
checking do we need to unping segments. In the current version of the
patch do_ts_dict_shmem_release() is called in
lookup_ts_dictionary_cache(). It unpins segments if text search cache
was invalidated. It unpins all segments, but I think it is ok since
text search changes should be infrequent.

> If you pin the mapping or the segment and later no longer want it
> pinned, there are dsm_unpin_mapping() and dsm_unpin_segment()
> functions available, too.  So it seems like what you might want to do
> is pin the segment when it's created, and then unpin it if it's
> stale/obsolete.  The latter won't remove it immediately, but will once
> all the mappings are gone.

Yes, dsm_unpin_mapping() and dsm_unpin_segment() will be called when the
dictionary is dropped or altered in the current version of the patch. I
descriped the approach above.


In sum, I think the problem is mostly solved. Backend 2 unpins the
segment in next ts_lexize() call. But if backend 2 doesn't call
ts_lexize() (or other TS function) anymore the segment will remain mapped.
It is the only problem I see for now.

I hope the description is clear. I attached the rebased patch. 

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Wed, May 16, 2018 at 7:36 AM, Arthur Zakirov
<a.zakirov@postgrespro.ru> wrote:
>> I don't quite understand the problem you're trying to solve here, but:
>>
>> 1. Unless dsm_pin_segment() is called, a DSM segment will
>> automatically be removed when there are no remaining mappings.
>>
>> 2. Unless dsm_pin_mapping() is called, a DSM segment will be unmapped
>> when the currently-in-scope resource owner is cleaned up, like at the
>> end of the query.  If it is called, then the mapping will stick around
>> until the backend exits.
>
> I tried to solve the case when DSM segment remains mapped even a
> dictionary was dropped. It may happen in the following situation:
>
> Backend 1:
>
> =# select ts_lexize('english_shared', 'test');
> -- The dictionary is loaded into DSM, the segment and the mapping is
> pinned
> ...
> -- Call ts_lexize() from backend 2 below
> =# drop text search dictionary english_shared;
> -- The segment and the mapping is unpinned, see ts_dict_shmem_release()
>
> Backend 2:
>
> =# select ts_lexize('english_shared', 'test');
> -- The dictionary got from DSM, the mapping is pinned
> ...
> -- The dictionary was dropped by backend 1, but the mapping still is
> pinned

Yeah, there's really nothing we can do about that (except switch from
processes to threads).  There's no way for one process to force
another process to unmap something.  As you've observed, you can get
it to be dropped eventually, but not immediately.

> In sum, I think the problem is mostly solved. Backend 2 unpins the
> segment in next ts_lexize() call. But if backend 2 doesn't call
> ts_lexize() (or other TS function) anymore the segment will remain mapped.
> It is the only problem I see for now.

Maybe you could use CacheRegisterSyscacheCallback to get a callback
when the backend notices that a DROP has occurred.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, May 16, 2018 at 09:33:46AM -0400, Robert Haas wrote:
> > In sum, I think the problem is mostly solved. Backend 2 unpins the
> > segment in next ts_lexize() call. But if backend 2 doesn't call
> > ts_lexize() (or other TS function) anymore the segment will remain mapped.
> > It is the only problem I see for now.
> 
> Maybe you could use CacheRegisterSyscacheCallback to get a callback
> when the backend notices that a DROP has occurred.

Yes, it was the first approach. DSM segments was unpinned in
InvalidateTSCacheCallBack() in that approach, which is registered using
CacheRegisterSyscacheCallback().

I haven't deep knowledge about guts of invalidation callbacks. It seems
that there is problem with it. Tom pointed above:

> > I'm not sure that I understood the second case correclty. Can cache
> > invalidation help in this case? I don't have confident knowledge of cache
> > invalidation. It seems to me that InvalidateTSCacheCallBack() should
> > release segment after commit.
>
> "Release after commit" sounds like a pretty dangerous design to me,
> because a release necessarily implies some kernel calls, which could
> fail. We can't afford to inject steps that might fail into post-commit
> cleanup (because it's too late to recover by failing the transaction).
> It'd be better to do cleanup while searching for a dictionary to use.

But it is possible that I misunderstood his note.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Wed, May 16, 2018 at 4:42 PM, Arthur Zakirov
<a.zakirov@postgrespro.ru> wrote:
> I haven't deep knowledge about guts of invalidation callbacks. It seems
> that there is problem with it. Tom pointed above:
>
>> > I'm not sure that I understood the second case correclty. Can cache
>> > invalidation help in this case? I don't have confident knowledge of cache
>> > invalidation. It seems to me that InvalidateTSCacheCallBack() should
>> > release segment after commit.
>>
>> "Release after commit" sounds like a pretty dangerous design to me,
>> because a release necessarily implies some kernel calls, which could
>> fail. We can't afford to inject steps that might fail into post-commit
>> cleanup (because it's too late to recover by failing the transaction).
>> It'd be better to do cleanup while searching for a dictionary to use.
>
> But it is possible that I misunderstood his note.

I think you and Tom have misunderstood each other somehow.  If you
look at CommitTransaction(), you will see a comment that says:

         * This is all post-commit cleanup.  Note that if an error is
raised here,
         * it's too late to abort the transaction.  This should be just
         * noncritical resource releasing.

Between that point and the end of that function, we shouldn't do
anything that throws an error, because the transaction is already
committed and it's too late to change our mind.  But if session A
drops an object, session B is not going to get a callback to
InvalidateTSCacheCallBack at that point.  It's going to happen
sometime in the middle of the transaction, like when it next tries to
lock a relation or something.  So Tom's complaint is irrelevant in
that scenario.

Also, there is no absolute prohibition on kernel calls in post-commit
cleanup, or in no-fail code in general.  For example, the
RESOURCE_RELEASE_AFTER_LOCKS phase of resowner cleanup calls
FileClose().  That's actually completely alarming when you really
think about it, because one of the documented return values for
close() is EIO, which certainly represents a very dangerous kind of
failure -- see nearby threads about fsync-safety.  Transaction abort
acquires and releases numerous LWLocks, which can result in kernel
calls that could fail.  We're OK with that because, in practice, it
never happens.  Unmapping a DSM segment is probably about as safe as
acquiring and releasing an LWLock, maybe safer.  On my MacBook, the
only documented return value for munmap is EINVAL, and any such error
would indicate a PostgreSQL bug (or a kernel bug, or a cosmic ray
hit).  I checked a Linux system; things there are less clear, because
mmap and mumap share a single man page, and mmap can fail for all
kinds of reasons.  But very few of the listed error codes look like
things that could legitimately happen during munmap.  Also, if munmap
did fail (or shmdt/shmctl if using System V shared memory), it would
be reported as a WARNING, not an ERROR, so we'd still be sorta OK.

I think the only real question here is whether it's safe, at a high
level, to drop the object at time T0 and have various backends drop
the mapping at unpredictable later times T1, T2, ... all greater than
T0.  Generally, one wants to remove all references to an object before
the object itself, which in this case we can't.  Assuming that we can
convince ourselves that that much is OK, I don't see why using a
syscache callback to help ensure that the mappings are blown away in
an at-least-somewhat-timely fashion is worse than any other approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... Assuming that we can
> convince ourselves that that much is OK, I don't see why using a
> syscache callback to help ensure that the mappings are blown away in
> an at-least-somewhat-timely fashion is worse than any other approach.

I think the point you've not addressed is that "syscache callback
occurred" does not equate to "object was dropped".  Can the code
survive having this occur at any invalidation point?
(CLOBBER_CACHE_ALWAYS testing would soon expose any fallacy there.)

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, May 17, 2018 at 09:57:59AM -0400, Robert Haas wrote:
> I think you and Tom have misunderstood each other somehow.  If you
> look at CommitTransaction(), you will see a comment that says:

Oh, I understood. You are right.

> Also, there is no absolute prohibition on kernel calls in post-commit
> cleanup, or in no-fail code in general.

Thank you for the explanation!

The current approach depends on syscache callbacks anyway. Backend 2
(from the example above) knows is it necessary to unpin segments after
syscache callback was called. Tom pointed below that callbacks are
occured in various events. So I think I should check the current approach
too using CLOBBER_CACHE_ALWAYS. It could show some problems in the
current patch.

Then if everything is OK I think I'll check another approach (unmapping
in TS syscache callback) using CLOBBER_CACHE_ALWAYS.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Thu, May 17, 2018 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... Assuming that we can
>> convince ourselves that that much is OK, I don't see why using a
>> syscache callback to help ensure that the mappings are blown away in
>> an at-least-somewhat-timely fashion is worse than any other approach.
>
> I think the point you've not addressed is that "syscache callback
> occurred" does not equate to "object was dropped".  Can the code
> survive having this occur at any invalidation point?
> (CLOBBER_CACHE_ALWAYS testing would soon expose any fallacy there.)

Well, I'm not advocating for a lack of testing, and
CLOBBER_CACHE_ALWAYS testing is a good idea.  However, I suspect that
calling dsm_detach() from a syscache callback should be fine.
Obviously there will be trouble if the surrounding code is still using
that mapping, but that would be a bug at some higher level, like using
an object without locking it.  And there will be trouble if you
register an on_dsm_detach callback that does something strange, but
the ones that the core code installs (when you use shm_mq, for
example) should be safe.  And there will be trouble if you're not
careful about memory contexts, because someplace you probably need to
remember that you detached from that DSM so you don't try to do it
again, and you'd better be sure you have the right context selected
when updating your data structures.  But it all seems pretty solvable.
I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 17, 2018 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the point you've not addressed is that "syscache callback
>> occurred" does not equate to "object was dropped".  Can the code
>> survive having this occur at any invalidation point?
>> (CLOBBER_CACHE_ALWAYS testing would soon expose any fallacy there.)

> Well, I'm not advocating for a lack of testing, and
> CLOBBER_CACHE_ALWAYS testing is a good idea.  However, I suspect that
> calling dsm_detach() from a syscache callback should be fine.
> Obviously there will be trouble if the surrounding code is still using
> that mapping, but that would be a bug at some higher level, like using
> an object without locking it.

No, you're clearly not getting the point.  You could have an absolutely
airtight exclusive lock of any description whatsoever, and that would
provide no guarantee at all that you don't get a cache flush callback.
It's only a cache, not a catalog, and it can get flushed for any reason
or no reason.  (That's why we have pin counts on catcache and relcache
entries, rather than assuming that locking the corresponding object is
enough.)  So I think it's highly likely that unmapping in a syscache
callback is going to lead quickly to SIGSEGV.  The only way it would not
is if we keep the shared dictionary mapped only in short straight-line
code segments that never do any other catalog accesses ... which seems
awkward, inefficient, and error-prone.

Do we actually need to worry about unmapping promptly on DROP TEXT
DICTIONARY?  It seems like the only downside of not doing that is that
we'd leak some address space until process exit.  If you were thrashing
dictionaries at some unreasonable rate on a 32-bit host, you might
eventually run some sessions out of address space; but that doesn't seem
like a situation that's so common that we need fragile coding to avoid it.

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Thu, May 17, 2018 at 1:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, May 17, 2018 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think the point you've not addressed is that "syscache callback
>>> occurred" does not equate to "object was dropped".  Can the code
>>> survive having this occur at any invalidation point?
>>> (CLOBBER_CACHE_ALWAYS testing would soon expose any fallacy there.)
>
>> Well, I'm not advocating for a lack of testing, and
>> CLOBBER_CACHE_ALWAYS testing is a good idea.  However, I suspect that
>> calling dsm_detach() from a syscache callback should be fine.
>> Obviously there will be trouble if the surrounding code is still using
>> that mapping, but that would be a bug at some higher level, like using
>> an object without locking it.
>
> No, you're clearly not getting the point.  You could have an absolutely
> airtight exclusive lock of any description whatsoever, and that would
> provide no guarantee at all that you don't get a cache flush callback.
> It's only a cache, not a catalog, and it can get flushed for any reason
> or no reason.  (That's why we have pin counts on catcache and relcache
> entries, rather than assuming that locking the corresponding object is
> enough.)  So I think it's highly likely that unmapping in a syscache
> callback is going to lead quickly to SIGSEGV.  The only way it would not
> is if we keep the shared dictionary mapped only in short straight-line
> code segments that never do any other catalog accesses ... which seems
> awkward, inefficient, and error-prone.

Yeah, that's true, but again, you can work around that problem.  A DSM
mapping is fundamentally not that different from a backend-private
memory allocation.  If you can avoid freeing memory while you're
referencing it -- as the catcache and the syscache clearly do -- you
can avoid it here, too.

> Do we actually need to worry about unmapping promptly on DROP TEXT
> DICTIONARY?  It seems like the only downside of not doing that is that
> we'd leak some address space until process exit.  If you were thrashing
> dictionaries at some unreasonable rate on a 32-bit host, you might
> eventually run some sessions out of address space; but that doesn't seem
> like a situation that's so common that we need fragile coding to avoid it.

I'm not sure what the situation is here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, May 17, 2018 at 10:18:56AM -0400, Tom Lane wrote:
> I think the point you've not addressed is that "syscache callback
> occurred" does not equate to "object was dropped".  Can the code
> survive having this occur at any invalidation point?
> (CLOBBER_CACHE_ALWAYS testing would soon expose any fallacy there.)

Thank you for the idea of testing with CLOBBER_CACHE_ALWAYS. I built
postgres with it and run regression tests. I tested both approaches. In
first glance they passed the tests.

There is no concurrent tests for text search feature with two and more
connections. Maybe it would be useful to make such tests. I did it
manually but it is better to have a script.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, May 17, 2018 at 02:14:07PM -0400, Robert Haas wrote:
> On Thu, May 17, 2018 at 1:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Do we actually need to worry about unmapping promptly on DROP TEXT
> > DICTIONARY?  It seems like the only downside of not doing that is that
> > we'd leak some address space until process exit.  If you were thrashing
> > dictionaries at some unreasonable rate on a 32-bit host, you might
> > eventually run some sessions out of address space; but that doesn't seem
> > like a situation that's so common that we need fragile coding to avoid it.
> 
> I'm not sure what the situation is here.

I think this case may take place when you continuously create, drop a
lot of dictionaries; different connections concurrently work with them
and some of connection stops working with text search at some point and
therefore pinned segments won't be unpinned.

But I'm not sure is this real case. Text search configuration changes
should be very infrequent (as it is written on in the
InvalidateTSCacheCallBack commentary).

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Wed, May 16, 2018 at 02:36:33PM +0300, Arthur Zakirov wrote:
> ... I attached the rebased patch.

I attached new version of the patch.

I found a bug when CompoundAffix,
SuffixNodes, PrefixNodes, DictNodes of IspellDictData structure are
empty. Now they have terminating entry and therefore they have at least
one node entry.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Thu, Jun 14, 2018 at 11:40:17AM +0300, Arthur Zakirov wrote:
> I attached new version of the patch.

The patch still applies to HEAD. I moved it to the next commitfest.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 01.10.2018 12:22, Arthur Zakirov wrote:
> On Thu, Jun 14, 2018 at 11:40:17AM +0300, Arthur Zakirov wrote:
>> I attached new version of the patch.
> 
> The patch still applies to HEAD. I moved it to the next commitfest.
> 

Here is the rebased patch. I also updated copyright in ts_shared.h and 
ts_shared.c.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
Hello Arthur,

I've looked at the patch today, and in general is seems quite solid to
me. I do have a couple of minor points

1) I think the comments need more work. Instead of describing all the
individual changes here, I've outlined those improvements in attached
patches (see the attached "tweaks" patches). Some of it is formatting,
minor rewording or larger changes. Some comments are rather redundant
(e.g. the one before calls to release the DSM segment).

2) It's not quite clear to me why we need DictInitData, which simply
combines DictPointerData and list of options. It seems as if the only
point is to pass a single parameter to the init function, but is it
worth it? Why not to get rid of DictInitData entirely and pass two
parameters instead?

3) I find it a bit cumbersome that before each ts_dict_shmem_release
call we construct a dummy DickPointerData value. Why not to pass
individual parameters and construct the struct in the function?

4) The reference to max_shared_dictionaries_size is obsolete, because
there's no such limit anymore.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello Tomas,

On 16.01.2019 03:23, Tomas Vondra wrote:
> I've looked at the patch today, and in general is seems quite solid to
> me. I do have a couple of minor points
> 
> 1) I think the comments need more work. Instead of describing all the
> individual changes here, I've outlined those improvements in attached
> patches (see the attached "tweaks" patches). Some of it is formatting,
> minor rewording or larger changes. Some comments are rather redundant
> (e.g. the one before calls to release the DSM segment).

Thank you!

> 2) It's not quite clear to me why we need DictInitData, which simply
> combines DictPointerData and list of options. It seems as if the only
> point is to pass a single parameter to the init function, but is it
> worth it? Why not to get rid of DictInitData entirely and pass two
> parameters instead?

In the first place init method had two parameters. But in the v7 patch I 
added DictInitData struct instead of two parameters (list of options and 
DictPointerData):
https://www.postgresql.org/message-id/20180319110648.GA32319%40zakirov.localdomain

I haven't way to replace template's init method from 
init_method(internal) to init_method(internal,internal) in the upgrade 
script of extensions. If I'm not mistaken we need new syntax here, like 
ALTER TEXT SEARCH TEMPLATE. Thoughts?

> 3) I find it a bit cumbersome that before each ts_dict_shmem_release
> call we construct a dummy DickPointerData value. Why not to pass
> individual parameters and construct the struct in the function?

Agree, it may look too verbose. I'll change it.

> 4) The reference to max_shared_dictionaries_size is obsolete, because
> there's no such limit anymore.

Yeah, I'll fix it.

 > /* XXX not really a pointer, so the name is misleading */

I think we don't need DictPointerData struct anymore, because only 
ts_dict_shmem_release function needs it (see comments above) and we only 
need it to hash search. I'll move all fields of DictPointerData to 
TsearchDictKey struct.

 > XXX "supported" is not the same as "all ispell dicts behave like that".

I'll reword the sentence.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
I attached files of new version of the patch, I applied your tweaks.

 > XXX All dictionaries, but only when there's invalid dictionary?

I've made a little optimization. I introduced hashvalue into 
TSDictionaryCacheEntry. Now released only DSM of altered or dropped 
dictionaries.

 >  > /* XXX not really a pointer, so the name is misleading */
 >
 > I think we don't need DictPointerData struct anymore, because only
 > ts_dict_shmem_release function needs it (see comments above) and we only
 > need it to hash search. I'll move all fields of DictPointerData to
 > TsearchDictKey struct.

I was wrong, DictInitData also needs DictPointerData. I didn't remove 
DictPointerData, I renamed it to DictEntryData. Hope that it is a more 
appropriate name.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 1/17/19 3:15 PM, Arthur Zakirov wrote:
> I attached files of new version of the patch, I applied your tweaks.
> 
>> XXX All dictionaries, but only when there's invalid dictionary?
> 
> I've made a little optimization. I introduced hashvalue into
> TSDictionaryCacheEntry. Now released only DSM of altered or dropped
> dictionaries.
> 
>>  > /* XXX not really a pointer, so the name is misleading */
>>
>> I think we don't need DictPointerData struct anymore, because only
>> ts_dict_shmem_release function needs it (see comments above) and we only
>> need it to hash search. I'll move all fields of DictPointerData to
>> TsearchDictKey struct.
> 
> I was wrong, DictInitData also needs DictPointerData. I didn't remove
> DictPointerData, I renamed it to DictEntryData. Hope that it is a more
> appropriate name.
> 

Thanks. I've reviewed v17 today and I haven't discovered any new issues
so far. If everything goes fine and no one protests, I plan to get it
committed over the next week or so.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:
On 2019-01-20 23:15:35 +0100, Tomas Vondra wrote:
> On 1/17/19 3:15 PM, Arthur Zakirov wrote:
> > I attached files of new version of the patch, I applied your tweaks.
> > 
> >> XXX All dictionaries, but only when there's invalid dictionary?
> > 
> > I've made a little optimization. I introduced hashvalue into
> > TSDictionaryCacheEntry. Now released only DSM of altered or dropped
> > dictionaries.
> > 
> >>  > /* XXX not really a pointer, so the name is misleading */
> >>
> >> I think we don't need DictPointerData struct anymore, because only
> >> ts_dict_shmem_release function needs it (see comments above) and we only
> >> need it to hash search. I'll move all fields of DictPointerData to
> >> TsearchDictKey struct.
> > 
> > I was wrong, DictInitData also needs DictPointerData. I didn't remove
> > DictPointerData, I renamed it to DictEntryData. Hope that it is a more
> > appropriate name.
> > 
> 
> Thanks. I've reviewed v17 today and I haven't discovered any new issues
> so far. If everything goes fine and no one protests, I plan to get it
> committed over the next week or so.

There doesn't seem to be any docs about what's needed to be able to take
advantage of shared dicts, and how to prevent them from permanently
taking up a significant share of memory.

Greetings,

Andres Freund


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 1/20/19 11:21 PM, Andres Freund wrote:
> On 2019-01-20 23:15:35 +0100, Tomas Vondra wrote:
>> On 1/17/19 3:15 PM, Arthur Zakirov wrote:
>>> I attached files of new version of the patch, I applied your tweaks.
>>>
>>>> XXX All dictionaries, but only when there's invalid dictionary?
>>>
>>> I've made a little optimization. I introduced hashvalue into
>>> TSDictionaryCacheEntry. Now released only DSM of altered or dropped
>>> dictionaries.
>>>
>>>>   > /* XXX not really a pointer, so the name is misleading */
>>>>
>>>> I think we don't need DictPointerData struct anymore, because only
>>>> ts_dict_shmem_release function needs it (see comments above) and we only
>>>> need it to hash search. I'll move all fields of DictPointerData to
>>>> TsearchDictKey struct.
>>>
>>> I was wrong, DictInitData also needs DictPointerData. I didn't remove
>>> DictPointerData, I renamed it to DictEntryData. Hope that it is a more
>>> appropriate name.
>>>
>>
>> Thanks. I've reviewed v17 today and I haven't discovered any new issues
>> so far. If everything goes fine and no one protests, I plan to get it
>> committed over the next week or so.
> 
> There doesn't seem to be any docs about what's needed to be able to take
> advantage of shared dicts, and how to prevent them from permanently
> taking up a significant share of memory.
> 

Yeah, those are good points. I agree the comments might be clearer, but
essentially ispell dictionaries are shared and everything else is not.

As for the memory consumption / unloading dicts - I agree that's
something we need to address. There used to be a way to specify memory
limit and ability to unload dictionaries explicitly, but both features
have been ditched. The assumption was that UNLOAD would be introduced
later, but that does not seem to have happened.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 21.01.2019 02:43, Tomas Vondra wrote:
> On 1/20/19 11:21 PM, Andres Freund wrote:
>> On 2019-01-20 23:15:35 +0100, Tomas Vondra wrote:
>>> Thanks. I've reviewed v17 today and I haven't discovered any new issues
>>> so far. If everything goes fine and no one protests, I plan to get it
>>> committed over the next week or so.
>>
>> There doesn't seem to be any docs about what's needed to be able to take
>> advantage of shared dicts, and how to prevent them from permanently
>> taking up a significant share of memory.
>>
> 
> Yeah, those are good points. I agree the comments might be clearer, but
> essentially ispell dictionaries are shared and everything else is not.
> 
> As for the memory consumption / unloading dicts - I agree that's
> something we need to address. There used to be a way to specify memory
> limit and ability to unload dictionaries explicitly, but both features
> have been ditched. The assumption was that UNLOAD would be introduced
> later, but that does not seem to have happened.

I'll try to implement the syntax, you suggested earlier:

ALTER TEXT SEARCH DICTIONARY x UNLOAD/RELOAD

The main point here is that UNLOAD/RELOAD can't release the memory 
immediately, because some other backend may pin a DSM.

The second point we should consider (I think) - how do we know which 
dictionary should be unloaded. There was such function earlier, which 
was removed. But what about adding an information in the "\dFd" psql's 
command output? It could be a column which shows is a dictionary loaded.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:
On 1/21/19 12:51 PM, Arthur Zakirov wrote:
> On 21.01.2019 02:43, Tomas Vondra wrote:
>> On 1/20/19 11:21 PM, Andres Freund wrote:
>>> On 2019-01-20 23:15:35 +0100, Tomas Vondra wrote:
>>>> Thanks. I've reviewed v17 today and I haven't discovered any new issues
>>>> so far. If everything goes fine and no one protests, I plan to get it
>>>> committed over the next week or so.
>>>
>>> There doesn't seem to be any docs about what's needed to be able to take
>>> advantage of shared dicts, and how to prevent them from permanently
>>> taking up a significant share of memory.
>>>
>>
>> Yeah, those are good points. I agree the comments might be clearer, but
>> essentially ispell dictionaries are shared and everything else is not.
>>
>> As for the memory consumption / unloading dicts - I agree that's
>> something we need to address. There used to be a way to specify memory
>> limit and ability to unload dictionaries explicitly, but both features
>> have been ditched. The assumption was that UNLOAD would be introduced
>> later, but that does not seem to have happened.
> 
> I'll try to implement the syntax, you suggested earlier:
> 
> ALTER TEXT SEARCH DICTIONARY x UNLOAD/RELOAD
> 
> The main point here is that UNLOAD/RELOAD can't release the memory
> immediately, because some other backend may pin a DSM.
> 
> The second point we should consider (I think) - how do we know which
> dictionary should be unloaded. There was such function earlier, which
> was removed. But what about adding an information in the "\dFd" psql's
> command output? It could be a column which shows is a dictionary loaded.
> 

The UNLOAD capability is probably a good start, but it's entirely manual
and I wonder if it's putting too much burden on the user. I mean, the
user has to realize the dictionaries are using a lot of shared memory,
has to decide which to unload, and then has to do UNLOAD on it. That's
not quite straightforward, especially if there's no way to determine
which dictionaries are currently loaded and how much memory they use :-(

Of course, the problem is not exactly new - we don't show dictionaries
already loaded into private memory. The only thing we have is "unload"
capability by closing the connection. OTOH the memory consumption should
be much lower thanks to using shared memory. So I think the patch is an
improvement even in this regard.

I wonder if we could devise some simple cache eviction policy. We don't
have any memory limit GUC anymore, but maybe we could use unload
dictionaries that were unused for sufficient amount of time (a couple of
minutes or so). Of course, the question is when exactly would it happen
(it seems far too expensive to invoke on each dict access, and it should
happen even when the dicts are not accessed at all).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 21.01.2019 17:56, Tomas Vondra wrote:
> On 1/21/19 12:51 PM, Arthur Zakirov wrote:
>> I'll try to implement the syntax, you suggested earlier:
>>
>> ALTER TEXT SEARCH DICTIONARY x UNLOAD/RELOAD
>>
>> The main point here is that UNLOAD/RELOAD can't release the memory
>> immediately, because some other backend may pin a DSM.
>>
>> The second point we should consider (I think) - how do we know which
>> dictionary should be unloaded. There was such function earlier, which
>> was removed. But what about adding an information in the "\dFd" psql's
>> command output? It could be a column which shows is a dictionary loaded.
>>
> ...The only thing we have is "unload" capability by closing the
> connection...
BTW, even if the connection was closed and there are no other 
connections a dictionary still remains "loaded". It is because 
dsm_pin_segment() is called during loading the dictionary into DSM.

> ...
> I wonder if we could devise some simple cache eviction policy. We don't
> have any memory limit GUC anymore, but maybe we could use unload
> dictionaries that were unused for sufficient amount of time (a couple of
> minutes or so). Of course, the question is when exactly would it happen
> (it seems far too expensive to invoke on each dict access, and it should
> happen even when the dicts are not accessed at all).

Yes, I thought about such feature too. Agree, it could be expensive 
since we need to scan pg_ts_dict table to get list of dictionaries (we 
can't scan dshash_table).

I haven't a good solution yet. I just had a thought to return 
max_shared_dictionaries_size. Then we can unload dictionaries (and scan 
the pg_ts_dict table) that were accessed a lot time ago if we reached 
the size limit.
We can't set exact size limit since we can't release the memory 
immediately. So max_shared_dictionaries_size can be renamed to 
shared_dictionaries_threshold. If it is equal to "0" then PostgreSQL has 
unlimited space for dictionaries.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
пн, 21 янв. 2019 г. в 19:42, Arthur Zakirov <a.zakirov@postgrespro.ru>:
>
> On 21.01.2019 17:56, Tomas Vondra wrote:
> > I wonder if we could devise some simple cache eviction policy. We don't
> > have any memory limit GUC anymore, but maybe we could use unload
> > dictionaries that were unused for sufficient amount of time (a couple of
> > minutes or so). Of course, the question is when exactly would it happen
> > (it seems far too expensive to invoke on each dict access, and it should
> > happen even when the dicts are not accessed at all).
>
> Yes, I thought about such feature too. Agree, it could be expensive
> since we need to scan pg_ts_dict table to get list of dictionaries (we
> can't scan dshash_table).
>
> I haven't a good solution yet. I just had a thought to return
> max_shared_dictionaries_size. Then we can unload dictionaries (and scan
> the pg_ts_dict table) that were accessed a lot time ago if we reached
> the size limit.
> We can't set exact size limit since we can't release the memory
> immediately. So max_shared_dictionaries_size can be renamed to
> shared_dictionaries_threshold. If it is equal to "0" then PostgreSQL has
> unlimited space for dictionaries.

I want to propose to clean up segments during vacuum/autovacuum. I'm not
aware of the politics of cleaning up objects besides relations during
vacuum/autovacuum. Could be it a good idea?

Vacuum might unload dictionaries when total size of loaded dictionaries
exceeds a threshold. When it happens vacuum scans loaded dictionaries and
unloads (unpins segments and removes hash table entries) those dictionaries
which isn't mapped to any backend process (it happens because
dsm_pin_segment() is called) anymore.

max_shared_dictionaries_size can be renamed to
shared_dictionaries_cleanup_threshold.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tomas Vondra
Date:

On 1/22/19 7:36 PM, Arthur Zakirov wrote:
> пн, 21 янв. 2019 г. в 19:42, Arthur Zakirov <a.zakirov@postgrespro.ru>:
>>
>> On 21.01.2019 17:56, Tomas Vondra wrote:
>>> I wonder if we could devise some simple cache eviction policy. We don't
>>> have any memory limit GUC anymore, but maybe we could use unload
>>> dictionaries that were unused for sufficient amount of time (a couple of
>>> minutes or so). Of course, the question is when exactly would it happen
>>> (it seems far too expensive to invoke on each dict access, and it should
>>> happen even when the dicts are not accessed at all).
>>
>> Yes, I thought about such feature too. Agree, it could be expensive
>> since we need to scan pg_ts_dict table to get list of dictionaries (we
>> can't scan dshash_table).
>>
>> I haven't a good solution yet. I just had a thought to return
>> max_shared_dictionaries_size. Then we can unload dictionaries (and scan
>> the pg_ts_dict table) that were accessed a lot time ago if we reached
>> the size limit.
>> We can't set exact size limit since we can't release the memory
>> immediately. So max_shared_dictionaries_size can be renamed to
>> shared_dictionaries_threshold. If it is equal to "0" then PostgreSQL has
>> unlimited space for dictionaries.
> 
> I want to propose to clean up segments during vacuum/autovacuum. I'm not
> aware of the politics of cleaning up objects besides relations during
> vacuum/autovacuum. Could be it a good idea?
> 

I doubt that's a good idea, for a couple of reasons. For example, would
it be bound to autovacuum on a particular object or would it happen as
part of each vacuum run? If the dict cleanup happens only when vacuuming
a particular object, then which one? If it happens on each autovacuum
run, then that may easily be far too frequent (it essentially makes the
cases with too frequent autovacuum runs even worse).

But also what happens when there only minimal write activity and thus no
regular autovacuum runs? Surely we should still do the dict cleanup.

> Vacuum might unload dictionaries when total size of loaded dictionaries
> exceeds a threshold. When it happens vacuum scans loaded dictionaries and
> unloads (unpins segments and removes hash table entries) those dictionaries
> which isn't mapped to any backend process (it happens because
> dsm_pin_segment() is called) anymore.
> 

Then why to bound that to autovacuum at all? Why not just make it part
of loading the dictionary?

> max_shared_dictionaries_size can be renamed to
> shared_dictionaries_cleanup_threshold.
> 

That really depends on what exactly the threshold does. If it only
triggers cleanup but does not enforce maximum amount of memory used by
dictionaries, then this name seems OK. If it ensures max amount of
memory, the max_..._size name would be better.

I think there are essentially two ways:

(a) Define max amount of memory available for shared dictionarires, and
come up with an eviction algorithm. This will be tricky, because when
the frequently-used dictionaries need a bit more memory than the limit,
this will result in trashing (evict+load over and over).

(b) Define what "unused" means for dictionaries, and unload dictionaries
that become unused. For example, we could track timestamp of the last
time each dict was used, and decide that dictionaries unused for 5 or
more minutes are unused. And evict those.

The advantage of (b) is that it adopts automatically, more or less. When
you have a bunch of frequently used dictionaries, the amount of shared
memory increases. If you stop using them, it decreases after a while.
And rarely used dicts won't force eviction of the frequently used ones.

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 22.01.2019 22:17, Tomas Vondra wrote:
> On 1/22/19 7:36 PM, Arthur Zakirov wrote:
>> max_shared_dictionaries_size can be renamed to
>> shared_dictionaries_cleanup_threshold.
> 
> That really depends on what exactly the threshold does. If it only
> triggers cleanup but does not enforce maximum amount of memory used by
> dictionaries, then this name seems OK. If it ensures max amount of
> memory, the max_..._size name would be better.

Yep, I thought about the first approach.

> I think there are essentially two ways:
> 
> (a) Define max amount of memory available for shared dictionarires, and
> come up with an eviction algorithm. This will be tricky, because when
> the frequently-used dictionaries need a bit more memory than the limit,
> this will result in trashing (evict+load over and over).
> 
> (b) Define what "unused" means for dictionaries, and unload dictionaries
> that become unused. For example, we could track timestamp of the last
> time each dict was used, and decide that dictionaries unused for 5 or
> more minutes are unused. And evict those.
> 
> The advantage of (b) is that it adopts automatically, more or less. When
> you have a bunch of frequently used dictionaries, the amount of shared
> memory increases. If you stop using them, it decreases after a while.
> And rarely used dicts won't force eviction of the frequently used ones.

Thanks for sharing your ideas, Tomas. Unfortunately I won't manage to 
develop new version of the patch till the end of the commitfest due to 
lack of time. I'll think about the second approach. Tracking timestamp 
of the last time a dict was used may be difficult though and may slow 
down FTS...

I move the path to the next commitfest.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 01.02.2019 12:09, Arthur Zakirov wrote:
> Thanks for sharing your ideas, Tomas. Unfortunately I won't manage to 
> develop new version of the patch till the end of the commitfest due to 
> lack of time. I'll think about the second approach. Tracking timestamp 
> of the last time a dict was used may be difficult though and may slow 
> down FTS...
> 
> I move the path to the next commitfest.

Oh, It seems it can't be moved to the next commitfest from status 
"Waiting on Author".

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Tue, Jan 22, 2019 at 2:17 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I think there are essentially two ways:
>
> (a) Define max amount of memory available for shared dictionarires, and
> come up with an eviction algorithm. This will be tricky, because when
> the frequently-used dictionaries need a bit more memory than the limit,
> this will result in trashing (evict+load over and over).
>
> (b) Define what "unused" means for dictionaries, and unload dictionaries
> that become unused. For example, we could track timestamp of the last
> time each dict was used, and decide that dictionaries unused for 5 or
> more minutes are unused. And evict those.
>
> The advantage of (b) is that it adopts automatically, more or less. When
> you have a bunch of frequently used dictionaries, the amount of shared
> memory increases. If you stop using them, it decreases after a while.
> And rarely used dicts won't force eviction of the frequently used ones.

+1 for (b).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:
Hi,

On 2019-02-01 09:40:44 -0500, Robert Haas wrote:
> On Tue, Jan 22, 2019 at 2:17 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > I think there are essentially two ways:
> >
> > (a) Define max amount of memory available for shared dictionarires, and
> > come up with an eviction algorithm. This will be tricky, because when
> > the frequently-used dictionaries need a bit more memory than the limit,
> > this will result in trashing (evict+load over and over).
> >
> > (b) Define what "unused" means for dictionaries, and unload dictionaries
> > that become unused. For example, we could track timestamp of the last
> > time each dict was used, and decide that dictionaries unused for 5 or
> > more minutes are unused. And evict those.
> >
> > The advantage of (b) is that it adopts automatically, more or less. When
> > you have a bunch of frequently used dictionaries, the amount of shared
> > memory increases. If you stop using them, it decreases after a while.
> > And rarely used dicts won't force eviction of the frequently used ones.
> 
> +1 for (b).

This patch has been waiting on author for two weeks, the commitfest has
ended, and there's substantial work needed. Therefore I'm marking the
patch as returned with feedback. Please resubmit a new version, once the
feedback has been addressed.

Greetings,

Andres Freund


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello,

I've created the new commitfest entry since the previous entry was 
closed with status "Returned with feedback":

https://commitfest.postgresql.org/22/2007/

I attached new version of the patch. There are changes only in 
0003-Retrieve-shared-location-for-dict-v18.patch.

I added a reference counter to shared hash tables dictionary entries. It 
is necessary to not face memory bloat. It is necessary to delete shared 
hash table entries if there are a lot of ALTER and DROP TEXT SEARCH 
DICTIONARY.

Previous version of the patch had released unused DSM segments but left 
shared hash table entries untouched.

There was refcnt before:

https://www.postgresql.org/message-id/20180403115720.GA7450%40zakirov.localdomain

But I didn't fully understand how on_dsm_detach() works.

On 22.01.2019 22:17, Tomas Vondra wrote:
> I think there are essentially two ways:
> 
> (a) Define max amount of memory available for shared dictionarires, and
> come up with an eviction algorithm. This will be tricky, because when
> the frequently-used dictionaries need a bit more memory than the limit,
> this will result in trashing (evict+load over and over).
> 
> (b) Define what "unused" means for dictionaries, and unload dictionaries
> that become unused. For example, we could track timestamp of the last
> time each dict was used, and decide that dictionaries unused for 5 or
> more minutes are unused. And evict those.
> 
> The advantage of (b) is that it adopts automatically, more or less. When
> you have a bunch of frequently used dictionaries, the amount of shared
> memory increases. If you stop using them, it decreases after a while.
> And rarely used dicts won't force eviction of the frequently used ones.
I'm working on the (b) approach. I thought about a priority queue 
structure. There no such ready structure within PostgreSQL sources 
except binaryheap.c, but it isn't for concurrent algorithms.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Wed, Feb 20, 2019 at 9:33 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> I'm working on the (b) approach. I thought about a priority queue
> structure. There no such ready structure within PostgreSQL sources
> except binaryheap.c, but it isn't for concurrent algorithms.

I don't see why you need a priority queue or, really, any other fancy
data structure.  It seems like all you need to do is somehow set it up
so that a backend which doesn't use a dictionary for a while will
dsm_detach() the segment.  Eventually an unused dictionary will have
no remaining references and will go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 21.02.2019 15:45, Robert Haas wrote:
> On Wed, Feb 20, 2019 at 9:33 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
>> I'm working on the (b) approach. I thought about a priority queue
>> structure. There no such ready structure within PostgreSQL sources
>> except binaryheap.c, but it isn't for concurrent algorithms.
> 
> I don't see why you need a priority queue or, really, any other fancy
> data structure.  It seems like all you need to do is somehow set it up
> so that a backend which doesn't use a dictionary for a while will
> dsm_detach() the segment.  Eventually an unused dictionary will have
> no remaining references and will go away.

Hm, I didn't think in this way. Agree that using a new data structure is 
overengineering.

Now in the current patch all DSM segments are pinned (and therefore 
dsm_pin_segment() is called). So a dictionary lives in shared memory 
even if nobody have the reference to it.

I thought about periodically scanning the shared hash table and 
unpinning old and unused dictionaries. But this approach needs 
sequential scan facility for dshash. Happily there is the patch from 
Kyotaro-san (the v16-0001-sequential-scan-for-dshash.patch part):

https://www.postgresql.org/message-id/20190221.160555.191280262.horiguchi.kyotaro@lab.ntt.co.jp

Your approach looks simpler. It is necessary just to periodically scan 
dictionaries' cache hash table and not call dsm_pin_segment() when a DSM 
segment initialized. It also means that a dictionary is loaded into DSM 
only while there is a backend which attached the dictionary's DSM.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Robert Haas
Date:
On Thu, Feb 21, 2019 at 8:28 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> Your approach looks simpler. It is necessary just to periodically scan
> dictionaries' cache hash table and not call dsm_pin_segment() when a DSM
> segment initialized. It also means that a dictionary is loaded into DSM
> only while there is a backend which attached the dictionary's DSM.

Right.  I think that having a central facility that tries to decide
whether or not a dictionary should be kept in shared memory or not,
e.g. based on a cache size parameter, isn't likely to work well.  The
problem is that if we make a decision that a dictionary should be
evicted because it's causing us to exceed the cache size threshold,
then we have no way to implement that decision.  We can't force other
backends to remove the mapping immediately, nor can we really bound
the time before they respond to a request to unmap it.  They might be
in the middle of using it.

So I think it's better to have each backend locally make a decision
about when that particular backend no longer needs the dictionary, and
then let the system automatically clean up the ones that are needed by
nobody.

Perhaps a better approach still would be to do what Andres proposed
back in March:

#> Is there any chance we can instead can convert dictionaries into a form
#> we can just mmap() into memory?  That'd scale a lot higher and more
#> dynamicallly?

The current approach inherently involves double-buffering: you've got
the filesystem cache containing the data read from disk, and then the
DSM containing the converted form of the data.  Having something that
you could just mmap() would avoid that, plus it would become a lot
less critical to keep the mappings around.  You could probably just
have individual queries mmap() it for as long as they need it and then
tear out the mapping when they finish executing; keeping the mappings
across queries likely wouldn't be too important in this case.

The downside is that you'd probably need to teach resowner.c about
mappings created via mmap() so that you don't leak mappings on an
abort, but that's probably not a crazy difficult problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Perhaps a better approach still would be to do what Andres proposed
> back in March:

> #> Is there any chance we can instead can convert dictionaries into a form
> #> we can just mmap() into memory?  That'd scale a lot higher and more
> #> dynamicallly?

That seems awfully attractive.  I was about to question whether we could
assume that mmap() works everywhere, but it's required by SUSv2 ... and
if anybody has anything sufficiently lame for it not to work, we could
fall back on malloc-a-hunk-of-memory-and-read-in-the-file.

We'd need a bunch of work to design a position-independent binary
representation for dictionaries, and then some tool to produce disk files
containing that, so this isn't exactly a quick route to a solution.
On the other hand, it isn't sounding like the current patch is getting
close to committable either.

(Actually, I guess you need a PI representation of a dictionary to
put it in a DSM either, so presumably that part of the work is
done already; although we might also wish for architecture independence
of the disk files, which we probably don't have right now.)

            regards, tom lane


Re: [PROPOSAL] Shared Ispell dictionaries

From
Andres Freund
Date:

On February 21, 2019 10:08:00 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Robert Haas <robertmhaas@gmail.com> writes:
>> Perhaps a better approach still would be to do what Andres proposed
>> back in March:
>
>> #> Is there any chance we can instead can convert dictionaries into a
>form
>> #> we can just mmap() into memory?  That'd scale a lot higher and
>more
>> #> dynamicallly?
>
>That seems awfully attractive.  I was about to question whether we
>could
>assume that mmap() works everywhere, but it's required by SUSv2 ... and
>if anybody has anything sufficiently lame for it not to work, we could
>fall back on malloc-a-hunk-of-memory-and-read-in-the-file.
>
>We'd need a bunch of work to design a position-independent binary
>representation for dictionaries, and then some tool to produce disk
>files
>containing that, so this isn't exactly a quick route to a solution.
>On the other hand, it isn't sounding like the current patch is getting
>close to committable either.
>
>(Actually, I guess you need a PI representation of a dictionary to
>put it in a DSM either, so presumably that part of the work is
>done already; although we might also wish for architecture independence
>of the disk files, which we probably don't have right now.)

That's what I was pushing for ages ago...
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 21.02.2019 19:13, Robert Haas wrote:
> So I think it's better to have each backend locally make a decision
> about when that particular backend no longer needs the dictionary, and
> then let the system automatically clean up the ones that are needed by
> nobody.

Yep, it wouldn't be hard to implement.

> Perhaps a better approach still would be to do what Andres proposed
> back in March:
> 
> #> Is there any chance we can instead can convert dictionaries into a form
> #> we can just mmap() into memory?  That'd scale a lot higher and more
> #> dynamicallly?
> 
> The current approach inherently involves double-buffering: you've got
> the filesystem cache containing the data read from disk, and then the
> DSM containing the converted form of the data.  Having something that
> you could just mmap() would avoid that, plus it would become a lot
> less critical to keep the mappings around.  You could probably just
> have individual queries mmap() it for as long as they need it and then
> tear out the mapping when they finish executing; keeping the mappings
> across queries likely wouldn't be too important in this case.
> 
> The downside is that you'd probably need to teach resowner.c about
> mappings created via mmap() so that you don't leak mappings on an
> abort, but that's probably not a crazy difficult problem.

It seems to me Tom and Andres also vote for the mmap() approach. I think 
I need to look closely at the mmap().

I've labeled the patch as 'v13'.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On 25.02.2019 14:33, Arthur Zakirov wrote:
> It seems to me Tom and Andres also vote for the mmap() approach. I think 
> I need to look closely at the mmap().
> 
> I've labeled the patch as 'v13'.
Unfortunately I didn't come up with a new patch yet. So I marked the 
entry as "Returned with feedback" for now.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
Hello hackers,

On 25.02.2019 14:33, Arthur Zakirov wrote:
>> The current approach inherently involves double-buffering: you've got
>> the filesystem cache containing the data read from disk, and then the
>> DSM containing the converted form of the data.  Having something that
>> you could just mmap() would avoid that, plus it would become a lot
>> less critical to keep the mappings around.  You could probably just
>> have individual queries mmap() it for as long as they need it and then
>> tear out the mapping when they finish executing; keeping the mappings
>> across queries likely wouldn't be too important in this case.
>>
>> The downside is that you'd probably need to teach resowner.c about
>> mappings created via mmap() so that you don't leak mappings on an
>> abort, but that's probably not a crazy difficult problem.
> 
> It seems to me Tom and Andres also vote for the mmap() approach. I think 
> I need to look closely at the mmap().
> 
> I've labeled the patch as 'v13'.

I've attached new version of the patch. Note that it is in WIP state for 
now and there are unresolved issues, which is listed at the end of the 
email.

The patch implements simple approach of using mmap(). Also I want to be 
sure that I'm going in right direction. Feel free to send a feedback.

On every dispell_init() call Postgres checks is there a shared 
dictionary file in the pg_shdict directory, if it is then calls mmap(). 
If there is no such file then it compiles the dictionary, write it to 
the file and calls mmap().

dispell_lexize() works with already mmap'ed dictionary. So it doesn't 
mmap() for each individual query as Robert proposed above. It's because 
such approach reduces performance twice (I tested with ts_lexize() calls 
by pgbench).

Tests
-----

Like in:
https://www.postgresql.org/message-id/20180124172039.GA11210%40zakirov.localdomain

i performed tests. There are now big differences in numbers except that 
files are being created now in the pg_shdict directory:

czech_hunspell - 9.2 MB file
english_hunspell - 1.9 MB file
french_hunspell - 4.6 MB file

TODO
----

- Improve the documentation and comments.
- Eliminate shared dictionary files after DROP/ALTER calls. It necessary 
to come up with some fancy file name. For now it is just OID of a 
dictionary. So it is possible to add database OID, xmin or xmax into a 
file name.
- We cant remove the file right away after DROP/ALTER. Is it good idea 
to use autovacuum here?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [PROPOSAL] Shared Ispell dictionaries

From
Alvaro Herrera
Date:
Is 0001 a bugfix?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PROPOSAL] Shared Ispell dictionaries

From
Arthur Zakirov
Date:
On Fri, Apr 5, 2019 at 8:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Is 0001 a bugfix?

Yep, it is rather a bugfix and can be applied independently.

The fix allocates temporary strings using temporary context
Conf->buildCxt.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company