Thread: [HACKERS] Remove 1MB size limit in tsvector

[HACKERS] Remove 1MB size limit in tsvector

From
Ildus Kurbangaliev
Date:
Hello, hackers!

Historically tsvector type can't hold more than 1MB data.
I want to propose a patch that removes that limit.

That limit is created by 'pos' field from WordEntry, which have only
20 bits for storage.

In the proposed patch I removed this field and instead of it I keep
offsets only at each Nth item in WordEntry's array. Now I set N as 4,
because it gave best results in my benchmarks. It can be increased in
the future without affecting already saved data in database. Also
removing the field improves compression of tsvectors.

I simplified the code by creating functions that can be used to
build tsvectors. There were duplicated code fragments in places where
tsvector was built.

Also new patch frees some space in WordEntry that can be used to
save some additional information about saved words.

- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Robert Haas
Date:
On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:
> Historically tsvector type can't hold more than 1MB data.
> I want to propose a patch that removes that limit.
>
> That limit is created by 'pos' field from WordEntry, which have only
> 20 bits for storage.
>
> In the proposed patch I removed this field and instead of it I keep
> offsets only at each Nth item in WordEntry's array.

So this would break pg_upgrade for tsvector columns?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus K
Date:
On Tue, 1 Aug 2017 14:56:54 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
> <i.kurbangaliev@postgrespro.ru> wrote:
> > Historically tsvector type can't hold more than 1MB data.
> > I want to propose a patch that removes that limit.
> >
> > That limit is created by 'pos' field from WordEntry, which have only
> > 20 bits for storage.
> >
> > In the proposed patch I removed this field and instead of it I keep
> > offsets only at each Nth item in WordEntry's array.  
> 
> So this would break pg_upgrade for tsvector columns?
> 

I added a function that will convert old tsvectors on the fly. It's the
approach used in hstore before.

Regards,
Ildus Kurbangaliev



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Robert Haas
Date:
On Tue, Aug 1, 2017 at 3:10 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:
>> So this would break pg_upgrade for tsvector columns?
>
> I added a function that will convert old tsvectors on the fly. It's the
> approach used in hstore before.

Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus K
Date:
On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
> <i.kurbangaliev@postgrespro.ru> wrote:
> >> So this would break pg_upgrade for tsvector columns?  
> >
> > I added a function that will convert old tsvectors on the fly. It's
> > the approach used in hstore before.  
> 
> Does that mean the answer to the question that I asked is "yes, but I
> have a workaround" or does it mean that the answer is "no"?
> 

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Regards,
Ildus Kurbangaliev



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Torsten Zuehlsdorff
Date:

On 01.08.2017 22:00, Ildus K wrote:
> On Tue, 1 Aug 2017 15:33:08 -0400
> Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
>> <i.kurbangaliev@postgrespro.ru> wrote:
>>>> So this would break pg_upgrade for tsvector columns?
>>>
>>> I added a function that will convert old tsvectors on the fly. It's
>>> the approach used in hstore before.
>>
>> Does that mean the answer to the question that I asked is "yes, but I
>> have a workaround" or does it mean that the answer is "no"?
>>
> 
> It's a workaround. DatumGetTSVector and
> DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> has old format.

I'm not familiar with pg_upgrade, but want to ask: should this 
workaround be part of pg_upgrade?

Greetings,
Torsten



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus Kurbangaliev
Date:
On Wed, 9 Aug 2017 09:01:44 +0200
Torsten Zuehlsdorff <mailinglists@toco-domains.de> wrote:

> On 01.08.2017 22:00, Ildus K wrote:
> > On Tue, 1 Aug 2017 15:33:08 -0400
> > Robert Haas <robertmhaas@gmail.com> wrote:
> >   
> >> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
> >> <i.kurbangaliev@postgrespro.ru> wrote:  
> >>>> So this would break pg_upgrade for tsvector columns?  
> >>>
> >>> I added a function that will convert old tsvectors on the fly.
> >>> It's the approach used in hstore before.  
> >>
> >> Does that mean the answer to the question that I asked is "yes,
> >> but I have a workaround" or does it mean that the answer is "no"?
> >>  
> > 
> > It's a workaround. DatumGetTSVector and
> > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > has old format.  
> 
> I'm not familiar with pg_upgrade, but want to ask: should this 
> workaround be part of pg_upgrade?
> 
> Greetings,
> Torsten

I chose the way when the data remains the same, until the user decides
to update it. I'm not so familiar with pg_upgrade myself and I don't
see now how the data will be converted with it, but it will anyway
increase downtime which is the shorter the better.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Robert Haas
Date:
On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:
> It's a workaround. DatumGetTSVector and
> DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> has old format.

Hmm, that seems like a real fix, not just a workaround.  If you can
transparently read the old format, there's no problem.  Not sure about
performance, though.

The patch doesn't really conform to our coding standards, though, so
you need to clean it up (or, if you're not sure what you need to do,
you need to have someone who knows how PostgreSQL code needs to look
review it for you).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Michael Paquier
Date:
On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The patch doesn't really conform to our coding standards, though, so
> you need to clean it up (or, if you're not sure what you need to do,
> you need to have someone who knows how PostgreSQL code needs to look
> review it for you).

The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html
-- 
Michael



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Alexander Korotkov
Date:
On Thu, Aug 10, 2017 at 7:37 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The patch doesn't really conform to our coding standards, though, so
> you need to clean it up (or, if you're not sure what you need to do,
> you need to have someone who knows how PostgreSQL code needs to look
> review it for you).

The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html

+1

Ildus, from the first glance I see at least following violations of PostgreSQL coding standards in your code.

+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
This comment will be reflowed by pgindent.  Also we don't use '@' for parameters description in comments.

+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+ int       i,
+           dataoff = 0,
+   datalen = 0,
+   totallen;
+ TSVector   in,
+   out;

You have random mix of tabs and spaces here.

+ {
+ stroff = SHORTALIGN(stroff); \
+ entry->hasoff = 0;
+ entry->len = lexeme_len;
+ entry->npos = npos;
+ }

What this backslash is doing here?
There are other similar (and probably different) violations of coding standard over the code.  Ildus, please check you patches carefully before publishing. 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Alexander Korotkov
Date:
On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:
> It's a workaround. DatumGetTSVector and
> DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> has old format.

Hmm, that seems like a real fix, not just a workaround.  If you can
transparently read the old format, there's no problem.  Not sure about
performance, though.

+1
Ildus, I think we need to benchmark reading of the old format.  There would be tradeoff between performance of old format reading and amount of extra code needed.  Once we will have benchmarks we can consider whether this is the solution we would like to buy. 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Tom Lane
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> ...
> You have random mix of tabs and spaces here.

It's worth running pgindent over your code before submitting.  It should
be pretty easy to set that up nowadays, see src/tools/pgindent/README.
(If you find any portability problems while trying to install pgindent,
please let me know.)
        regards, tom lane



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus Kurbangaliev
Date:
On Thu, 10 Aug 2017 11:46:55 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > ...
> > You have random mix of tabs and spaces here.  
> 
> It's worth running pgindent over your code before submitting.  It
> should be pretty easy to set that up nowadays, see
> src/tools/pgindent/README. (If you find any portability problems
> while trying to install pgindent, please let me know.)

Attached a new version of the patch. It mostly contains cosmetic
changes. I rebased it to current master, ran pgindent and fixed
formatting errors.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus Kurbangaliev
Date:
On Thu, 10 Aug 2017 18:06:17 +0300
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com>
> wrote:
> 
> > On Tue, Aug 1, 2017 at 4:00 PM, Ildus K
> > <i.kurbangaliev@postgrespro.ru> wrote:  
> > > It's a workaround. DatumGetTSVector and
> > > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > > has old format.  
> >
> > Hmm, that seems like a real fix, not just a workaround.  If you can
> > transparently read the old format, there's no problem.  Not sure
> > about performance, though.
> >  
> 
> +1
> Ildus, I think we need to benchmark reading of the old format.  There
> would be tradeoff between performance of old format reading and
> amount of extra code needed.  Once we will have benchmarks we can
> consider whether this is the solution we would like to buy.

In my benchmarks when database fits into buffers (so it's measurement of
the time required for the tsvectors conversion) it gives me these
results:

Without conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:04:44 Number of connections:  4
2017/08/17 12:04:44 Database:  test1
2017/08/17 12:09:44 Processed: 51419

With conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:14:31 Number of connections:  4
2017/08/17 12:14:31 Database:  test1
2017/08/17 12:19:31 Processed: 43607

I ran a bunch of these tests, and these results are stable on my
machine. So in these specific tests performance regression about 15%.

Same time I think this could be the worst case, because usually data
is on disk and conversion will not affect so much to performance.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Tomas Vondra
Date:
Hi,

On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> In my benchmarks when database fits into buffers (so it's measurement of
> the time required for the tsvectors conversion) it gives me these
> results:
> 
> Without conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:04:44 Number of connections:  4
> 2017/08/17 12:04:44 Database:  test1
> 2017/08/17 12:09:44 Processed: 51419
> 
> With conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:14:31 Number of connections:  4
> 2017/08/17 12:14:31 Database:  test1
> 2017/08/17 12:19:31 Processed: 43607
> 
> I ran a bunch of these tests, and these results are stable on my
> machine. So in these specific tests performance regression about 15%.
> 
> Same time I think this could be the worst case, because usually data
> is on disk and conversion will not affect so much to performance.
> 

That seems like a fairly significant regression, TBH. I don't quite
agree we can simply assume in-memory workloads don't matter, plenty of
databases have 99% cache hit ratio (particularly when considering not
just shared buffers, but also page cache).

Can you share the benchmarks, so that others can retry running them?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Remove 1MB size limit in tsvector

From
Ildus Kurbangaliev
Date:
On Thu, 7 Sep 2017 23:08:14 +0200
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> Hi,
> 
> On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> > In my benchmarks when database fits into buffers (so it's
> > measurement of the time required for the tsvectors conversion) it
> > gives me these results:
> > 
> > Without conversion:
> > 
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:04:44 Number of connections:  4
> > 2017/08/17 12:04:44 Database:  test1
> > 2017/08/17 12:09:44 Processed: 51419
> > 
> > With conversion:
> > 
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:14:31 Number of connections:  4
> > 2017/08/17 12:14:31 Database:  test1
> > 2017/08/17 12:19:31 Processed: 43607
> > 
> > I ran a bunch of these tests, and these results are stable on my
> > machine. So in these specific tests performance regression about
> > 15%.
> > 
> > Same time I think this could be the worst case, because usually data
> > is on disk and conversion will not affect so much to performance.
> >   
> 
> That seems like a fairly significant regression, TBH. I don't quite
> agree we can simply assume in-memory workloads don't matter, plenty of
> databases have 99% cache hit ratio (particularly when considering not
> just shared buffers, but also page cache).

I think part of this regression is caused by better compression of new
format. I can't say exact percent here, need to check with perf.

If you care about performace, you create indexes, which means that
tsvector will no longer be used for text search (except for ORDER BY
rank). Index machinery will only peek into tsquery. Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

> 
> Can you share the benchmarks, so that others can retry running them?

Benchmarks are published at github:
https://github.com/ildus/tsbench . I'm not sure that they are easy to
use.

Best regards,
Ildus Kurbangaliev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Robert Haas
Date:
On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:
> Moreover, RUM index
> stores positions + lexemes, so it doesn't need tsvectors for ranked
> search. As a result, tsvector becomes a storage for
> building indexes (indexable type), not something that should be used at
> runtime. And the change of the format doesn't affect index creation
> time.

RUM indexes, though, are not in core.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Tomas Vondra
Date:
On 09/11/2017 01:54 PM, Robert Haas wrote:
> On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
> <i.kurbangaliev@postgrespro.ru> wrote:
>> Moreover, RUM index
>> stores positions + lexemes, so it doesn't need tsvectors for ranked
>> search. As a result, tsvector becomes a storage for
>> building indexes (indexable type), not something that should be used at
>> runtime. And the change of the format doesn't affect index creation
>> time.
> 
> RUM indexes, though, are not in core.
> 

Yeah, but I think Ildus has a point that this should not really matter
on indexed tsvectors. So the question is how realistic that benchmark
actually is. How likely are we to do queries on fts directly, not
through a GIN/GiST index? Particularly in performance-sensitive cases?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove 1MB size limit in tsvector

From
Michael Paquier
Date:
On Mon, Sep 11, 2017 at 9:51 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/11/2017 01:54 PM, Robert Haas wrote:
>> On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
>> <i.kurbangaliev@postgrespro.ru> wrote:
>>> Moreover, RUM index
>>> stores positions + lexemes, so it doesn't need tsvectors for ranked
>>> search. As a result, tsvector becomes a storage for
>>> building indexes (indexable type), not something that should be used at
>>> runtime. And the change of the format doesn't affect index creation
>>> time.
>>
>> RUM indexes, though, are not in core.
>>
>
> Yeah, but I think Ildus has a point that this should not really matter
> on indexed tsvectors. So the question is how realistic that benchmark
> actually is. How likely are we to do queries on fts directly, not
> through a GIN/GiST index? Particularly in performance-sensitive cases?

So many questions unanswered... I am marking the patch as returned
with feedback as the thread has stalled for two months now.
-- 
Michael