Thread: ADD COLUMN ts tsvector GENERATED too slow

ADD COLUMN ts tsvector GENERATED too slow

From
Florents Tselai
Date:
I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows. 
These are text fields extracted from 4-5 page pdfs each.

I’m adding the following generated col to keep up with tsvectors

ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED

I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and
there’sa smarter way. 

I thought about incremental updates and/or triggers but a generated col is a cleaner solution.


Re: ADD COLUMN ts tsvector GENERATED too slow

From
Peter Eisentraut
Date:
On 06.07.22 10:42, Florents Tselai wrote:
> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows.
 
> These are text fields extracted from 4-5 page pdfs each.
> 
> I’m adding the following generated col to keep up with tsvectors
> 
> ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED
> 
> I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and
there’sa smarter way.
 

Maybe it's stuck on a lock?  ALTER TABLE / ADD COLUMN requires an 
exclusive lock on the table.



Re: ADD COLUMN ts tsvector GENERATED too slow

From
Florents Tselai
Date:

> On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 06.07.22 10:42, Florents Tselai wrote:
>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows. 
>> These are text fields extracted from 4-5 page pdfs each.
>> I’m adding the following generated col to keep up with tsvectors
>> ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED
>> I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and
there’sa smarter way. 
>
> Maybe it's stuck on a lock?  ALTER TABLE / ADD COLUMN requires an exclusive lock on the table.

Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar.




Re: ADD COLUMN ts tsvector GENERATED too slow

From
Florents Tselai
Date:
Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.

> On 6 Jul 2022, at 12:48 PM, Florents Tselai <florents.tselai@gmail.com> wrote:
>
>
>
>> On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>
>> On 06.07.22 10:42, Florents Tselai wrote:
>>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows. 
>>> These are text fields extracted from 4-5 page pdfs each.
>>> I’m adding the following generated col to keep up with tsvectors
>>> ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575)))
STORED
>>> I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong
andthere’s a smarter way. 
>>
>> Maybe it's stuck on a lock?  ALTER TABLE / ADD COLUMN requires an exclusive lock on the table.
>
> Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar.
>
>
>




Re: ADD COLUMN ts tsvector GENERATED too slow

From
Francisco Olarte
Date:
On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:
> Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.
...
> >> On 06.07.22 10:42, Florents Tselai wrote:
> >>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows.
 
> >>> These are text fields extracted from 4-5 page pdfs each.

How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.

If you have stopped it I would try doing a 1000 row sample in a copied
table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of
progress, I'm not sure there is another thing you could look at
without disturbing it.

FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..



Francisco Olarte.



Re: ADD COLUMN ts tsvector GENERATED too slow

From
Florents Tselai
Date:

> On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
>
> On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:
>> Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.
> ...
>>>> On 06.07.22 10:42, Florents Tselai wrote:
>>>>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of
around50M rows. 
>>>>> These are text fields extracted from 4-5 page pdfs each.
>
> How big is yout table? from your query it seems you expect more than
> 1M-1 ( left... ), but if you have very big text columns it may be
> spending a lot of time fully decompressing / reading them ( I'm not
> sure if it left(..) on toasted values is optimized to stop after
> reading enough ). Also, it has to rewrite a lot of data to insert the
> columns, it it takes some ms per row which I would not discard 50M
> rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
> 2 ms ( which may be right for reading a big row, calculating the
> vector and writing an even bigger row ) it would take more than a day
> to finish, which I would not discard given you are asking for a heavy
> thing.

50M+ rows and iirc pg_relation_size was north of 80GB or so.

>
> If you have stopped it I would try doing a 1000 row sample in a copied

Haven’t stopped it as I’m not convinced    there’s an alternative to just waiting
For it to complete :/

> table to get an speed idea. Otherwise, with this query, I would
> normally monitor disk usage of disk files as an indication of
> progress, I'm not sure there is another thing you could look at
> without disturbing it.
>
> FWIW, I would consider high mem usage normal in these kind of query,
> hi cpu would depend on what you call it, but it wouldn't surprise me
> if it has at least one cpu running at full detoasting and doing
> vectors, I do not know if alter table can go paralell..
>

You’re probably right, a lot of the CPU usage could be detoasting.

> Francisco Olarte.
>
>




Re: ADD COLUMN ts tsvector GENERATED too slow

From
Florents Tselai
Date:


On 6 Jul 2022, at 1:35 PM, Florents Tselai <florents.tselai@gmail.com> wrote:



On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote:

On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:
Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.
...
On 06.07.22 10:42, Florents Tselai wrote:
I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.

50M+ rows and iirc pg_relation_size was north of 80GB or so.


If you have stopped it I would try doing a 1000 row sample in a copied

Haven’t stopped it as I’m not convinced there’s an alternative to just waiting
For it to complete :/ 

table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of

Actually, I monitored my disk usage and it was **definitely** working as 
It had already eaten up an additional 30% of my disk capacity.

Thus, I’ll have to fall back on my initial solution and use GIN indexes 
To get ts_vectors on the fly.

progress, I'm not sure there is another thing you could look at
without disturbing it.

FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..


You’re probably right, a lot of the CPU usage could be detoasting.

Francisco Olarte.


Thanks everyone for your comments.
You can consider this solved. 

Re: ADD COLUMN ts tsvector GENERATED too slow

From
Alvaro Herrera
Date:
On 2022-Jul-06, Florents Tselai wrote:

> Actually, I monitored my disk usage and it was **definitely** working as 
> It had already eaten up an additional 30% of my disk capacity.

Adding a column like this requires creating a second copy of the table,
copying all the contents from the old table (plus the new column) into
the new one, then recreating all indexes from scratch on the new copy of
the table.  If you have a lot of indexes, this can be significant.
Also, AFAIU all data has to be uncompressed on read, then compressed
back on write.

Note: the 80 GB from pg_relation_size() does *not* include the size of
TOAST data.  You're copying a lot of additional data.  See
pg_table_size().

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)



Re: ADD COLUMN ts tsvector GENERATED too slow

From
Francisco Olarte
Date:
Hi Florents:

On Wed, 6 Jul 2022 at 12:35, Florents Tselai <florents.tselai@gmail.com> wrote:
> 50M+ rows and iirc pg_relation_size was north of 80GB or so.

Somebody already mentioned pg_table_size, but anyway this gives you
1.6Kb per row, which I would suspect is extremely low given your pdf
content and the 1M truncation you preventively applied.

When working with this kind of sizes it really pays to do some good
ole "back of the envelope" calculations and/or some tests with small
batches. Bear in mind if you have, say, 10k per row ( which I would
label as really low for pdf content ) you are going to have 500Gb of
data, if you manage to process this at 100Mb per second you will have
more than an hour ( and I think I'm estimating really low ).

FOS