Thread: [HACKERS] Remove 1MB size limit in tsvector
Hello, hackers! Historically tsvector type can't hold more than 1MB data. I want to propose a patch that removes that limit. That limit is created by 'pos' field from WordEntry, which have only 20 bits for storage. In the proposed patch I removed this field and instead of it I keep offsets only at each Nth item in WordEntry's array. Now I set N as 4, because it gave best results in my benchmarks. It can be increased in the future without affecting already saved data in database. Also removing the field improves compression of tsvectors. I simplified the code by creating functions that can be used to build tsvectors. There were duplicated code fragments in places where tsvector was built. Also new patch frees some space in WordEntry that can be used to save some additional information about saved words. - --- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > Historically tsvector type can't hold more than 1MB data. > I want to propose a patch that removes that limit. > > That limit is created by 'pos' field from WordEntry, which have only > 20 bits for storage. > > In the proposed patch I removed this field and instead of it I keep > offsets only at each Nth item in WordEntry's array. So this would break pg_upgrade for tsvector columns? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 1 Aug 2017 14:56:54 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: > > Historically tsvector type can't hold more than 1MB data. > > I want to propose a patch that removes that limit. > > > > That limit is created by 'pos' field from WordEntry, which have only > > 20 bits for storage. > > > > In the proposed patch I removed this field and instead of it I keep > > offsets only at each Nth item in WordEntry's array. > > So this would break pg_upgrade for tsvector columns? > I added a function that will convert old tsvectors on the fly. It's the approach used in hstore before. Regards, Ildus Kurbangaliev
On Tue, Aug 1, 2017 at 3:10 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote: >> So this would break pg_upgrade for tsvector columns? > > I added a function that will convert old tsvectors on the fly. It's the > approach used in hstore before. Does that mean the answer to the question that I asked is "yes, but I have a workaround" or does it mean that the answer is "no"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 1 Aug 2017 15:33:08 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 1, 2017 at 3:10 PM, Ildus K > <i.kurbangaliev@postgrespro.ru> wrote: > >> So this would break pg_upgrade for tsvector columns? > > > > I added a function that will convert old tsvectors on the fly. It's > > the approach used in hstore before. > > Does that mean the answer to the question that I asked is "yes, but I > have a workaround" or does it mean that the answer is "no"? > It's a workaround. DatumGetTSVector and DatumGetTSVectorCopy will upgrade tsvector on the fly if it has old format. Regards, Ildus Kurbangaliev
On 01.08.2017 22:00, Ildus K wrote: > On Tue, 1 Aug 2017 15:33:08 -0400 > Robert Haas <robertmhaas@gmail.com> wrote: > >> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K >> <i.kurbangaliev@postgrespro.ru> wrote: >>>> So this would break pg_upgrade for tsvector columns? >>> >>> I added a function that will convert old tsvectors on the fly. It's >>> the approach used in hstore before. >> >> Does that mean the answer to the question that I asked is "yes, but I >> have a workaround" or does it mean that the answer is "no"? >> > > It's a workaround. DatumGetTSVector and > DatumGetTSVectorCopy will upgrade tsvector on the fly if it > has old format. I'm not familiar with pg_upgrade, but want to ask: should this workaround be part of pg_upgrade? Greetings, Torsten
On Wed, 9 Aug 2017 09:01:44 +0200 Torsten Zuehlsdorff <mailinglists@toco-domains.de> wrote: > On 01.08.2017 22:00, Ildus K wrote: > > On Tue, 1 Aug 2017 15:33:08 -0400 > > Robert Haas <robertmhaas@gmail.com> wrote: > > > >> On Tue, Aug 1, 2017 at 3:10 PM, Ildus K > >> <i.kurbangaliev@postgrespro.ru> wrote: > >>>> So this would break pg_upgrade for tsvector columns? > >>> > >>> I added a function that will convert old tsvectors on the fly. > >>> It's the approach used in hstore before. > >> > >> Does that mean the answer to the question that I asked is "yes, > >> but I have a workaround" or does it mean that the answer is "no"? > >> > > > > It's a workaround. DatumGetTSVector and > > DatumGetTSVectorCopy will upgrade tsvector on the fly if it > > has old format. > > I'm not familiar with pg_upgrade, but want to ask: should this > workaround be part of pg_upgrade? > > Greetings, > Torsten I chose the way when the data remains the same, until the user decides to update it. I'm not so familiar with pg_upgrade myself and I don't see now how the data will be converted with it, but it will anyway increase downtime which is the shorter the better. -- --- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote: > It's a workaround. DatumGetTSVector and > DatumGetTSVectorCopy will upgrade tsvector on the fly if it > has old format. Hmm, that seems like a real fix, not just a workaround. If you can transparently read the old format, there's no problem. Not sure about performance, though. The patch doesn't really conform to our coding standards, though, so you need to clean it up (or, if you're not sure what you need to do, you need to have someone who knows how PostgreSQL code needs to look review it for you). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The patch doesn't really conform to our coding standards, though, so > you need to clean it up (or, if you're not sure what you need to do, > you need to have someone who knows how PostgreSQL code needs to look > review it for you). The documentation has a couple of rules for coding conventions: https://www.postgresql.org/docs/9.6/static/source.html -- Michael
On Thu, Aug 10, 2017 at 7:37 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The patch doesn't really conform to our coding standards, though, so
> you need to clean it up (or, if you're not sure what you need to do,
> you need to have someone who knows how PostgreSQL code needs to look
> review it for you).
The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html
+1
Ildus, from the first glance I see at least following violations of PostgreSQL coding standards in your code.
+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
This comment will be reflowed by pgindent. Also we don't use '@' for parameters description in comments.
+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+ int i,
+ dataoff = 0,
+ datalen = 0,
+ totallen;
+ TSVector in,
+ out;
You have random mix of tabs and spaces here.
+ {
+ stroff = SHORTALIGN(stroff); \
+ entry->hasoff = 0;
+ entry->len = lexeme_len;
+ entry->npos = npos;
+ }
What this backslash is doing here?
There are other similar (and probably different) violations of coding standard over the code. Ildus, please check you patches carefully before publishing.
------
Alexander Korotkov
Postgres Professional: http://www. postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:
> It's a workaround. DatumGetTSVector and
> DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> has old format.
Hmm, that seems like a real fix, not just a workaround. If you can
transparently read the old format, there's no problem. Not sure about
performance, though.
+1
Ildus, I think we need to benchmark reading of the old format. There would be tradeoff between performance of old format reading and amount of extra code needed. Once we will have benchmarks we can consider whether this is the solution we would like to buy.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > ... > You have random mix of tabs and spaces here. It's worth running pgindent over your code before submitting. It should be pretty easy to set that up nowadays, see src/tools/pgindent/README. (If you find any portability problems while trying to install pgindent, please let me know.) regards, tom lane
On Thu, 10 Aug 2017 11:46:55 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > > ... > > You have random mix of tabs and spaces here. > > It's worth running pgindent over your code before submitting. It > should be pretty easy to set that up nowadays, see > src/tools/pgindent/README. (If you find any portability problems > while trying to install pgindent, please let me know.) Attached a new version of the patch. It mostly contains cosmetic changes. I rebased it to current master, ran pgindent and fixed formatting errors. -- --- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, 10 Aug 2017 18:06:17 +0300 Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com> > wrote: > > > On Tue, Aug 1, 2017 at 4:00 PM, Ildus K > > <i.kurbangaliev@postgrespro.ru> wrote: > > > It's a workaround. DatumGetTSVector and > > > DatumGetTSVectorCopy will upgrade tsvector on the fly if it > > > has old format. > > > > Hmm, that seems like a real fix, not just a workaround. If you can > > transparently read the old format, there's no problem. Not sure > > about performance, though. > > > > +1 > Ildus, I think we need to benchmark reading of the old format. There > would be tradeoff between performance of old format reading and > amount of extra code needed. Once we will have benchmarks we can > consider whether this is the solution we would like to buy. In my benchmarks when database fits into buffers (so it's measurement of the time required for the tsvectors conversion) it gives me these results: Without conversion: $ ./tsbench2 -database test1 -bench_time 300 2017/08/17 12:04:44 Number of connections: 4 2017/08/17 12:04:44 Database: test1 2017/08/17 12:09:44 Processed: 51419 With conversion: $ ./tsbench2 -database test1 -bench_time 300 2017/08/17 12:14:31 Number of connections: 4 2017/08/17 12:14:31 Database: test1 2017/08/17 12:19:31 Processed: 43607 I ran a bunch of these tests, and these results are stable on my machine. So in these specific tests performance regression about 15%. Same time I think this could be the worst case, because usually data is on disk and conversion will not affect so much to performance. -- --- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hi, On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote: > In my benchmarks when database fits into buffers (so it's measurement of > the time required for the tsvectors conversion) it gives me these > results: > > Without conversion: > > $ ./tsbench2 -database test1 -bench_time 300 > 2017/08/17 12:04:44 Number of connections: 4 > 2017/08/17 12:04:44 Database: test1 > 2017/08/17 12:09:44 Processed: 51419 > > With conversion: > > $ ./tsbench2 -database test1 -bench_time 300 > 2017/08/17 12:14:31 Number of connections: 4 > 2017/08/17 12:14:31 Database: test1 > 2017/08/17 12:19:31 Processed: 43607 > > I ran a bunch of these tests, and these results are stable on my > machine. So in these specific tests performance regression about 15%. > > Same time I think this could be the worst case, because usually data > is on disk and conversion will not affect so much to performance. > That seems like a fairly significant regression, TBH. I don't quite agree we can simply assume in-memory workloads don't matter, plenty of databases have 99% cache hit ratio (particularly when considering not just shared buffers, but also page cache). Can you share the benchmarks, so that others can retry running them? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 7 Sep 2017 23:08:14 +0200 Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote: > > In my benchmarks when database fits into buffers (so it's > > measurement of the time required for the tsvectors conversion) it > > gives me these results: > > > > Without conversion: > > > > $ ./tsbench2 -database test1 -bench_time 300 > > 2017/08/17 12:04:44 Number of connections: 4 > > 2017/08/17 12:04:44 Database: test1 > > 2017/08/17 12:09:44 Processed: 51419 > > > > With conversion: > > > > $ ./tsbench2 -database test1 -bench_time 300 > > 2017/08/17 12:14:31 Number of connections: 4 > > 2017/08/17 12:14:31 Database: test1 > > 2017/08/17 12:19:31 Processed: 43607 > > > > I ran a bunch of these tests, and these results are stable on my > > machine. So in these specific tests performance regression about > > 15%. > > > > Same time I think this could be the worst case, because usually data > > is on disk and conversion will not affect so much to performance. > > > > That seems like a fairly significant regression, TBH. I don't quite > agree we can simply assume in-memory workloads don't matter, plenty of > databases have 99% cache hit ratio (particularly when considering not > just shared buffers, but also page cache). I think part of this regression is caused by better compression of new format. I can't say exact percent here, need to check with perf. If you care about performace, you create indexes, which means that tsvector will no longer be used for text search (except for ORDER BY rank). Index machinery will only peek into tsquery. Moreover, RUM index stores positions + lexemes, so it doesn't need tsvectors for ranked search. As a result, tsvector becomes a storage for building indexes (indexable type), not something that should be used at runtime. And the change of the format doesn't affect index creation time. > > Can you share the benchmarks, so that others can retry running them? Benchmarks are published at github: https://github.com/ildus/tsbench . I'm not sure that they are easy to use. Best regards, Ildus Kurbangaliev -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > Moreover, RUM index > stores positions + lexemes, so it doesn't need tsvectors for ranked > search. As a result, tsvector becomes a storage for > building indexes (indexable type), not something that should be used at > runtime. And the change of the format doesn't affect index creation > time. RUM indexes, though, are not in core. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 09/11/2017 01:54 PM, Robert Haas wrote: > On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: >> Moreover, RUM index >> stores positions + lexemes, so it doesn't need tsvectors for ranked >> search. As a result, tsvector becomes a storage for >> building indexes (indexable type), not something that should be used at >> runtime. And the change of the format doesn't affect index creation >> time. > > RUM indexes, though, are not in core. > Yeah, but I think Ildus has a point that this should not really matter on indexed tsvectors. So the question is how realistic that benchmark actually is. How likely are we to do queries on fts directly, not through a GIN/GiST index? Particularly in performance-sensitive cases? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 11, 2017 at 9:51 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 09/11/2017 01:54 PM, Robert Haas wrote: >> On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev >> <i.kurbangaliev@postgrespro.ru> wrote: >>> Moreover, RUM index >>> stores positions + lexemes, so it doesn't need tsvectors for ranked >>> search. As a result, tsvector becomes a storage for >>> building indexes (indexable type), not something that should be used at >>> runtime. And the change of the format doesn't affect index creation >>> time. >> >> RUM indexes, though, are not in core. >> > > Yeah, but I think Ildus has a point that this should not really matter > on indexed tsvectors. So the question is how realistic that benchmark > actually is. How likely are we to do queries on fts directly, not > through a GIN/GiST index? Particularly in performance-sensitive cases? So many questions unanswered... I am marking the patch as returned with feedback as the thread has stalled for two months now. -- Michael