Thread: text fields and performance for ETL

text fields and performance for ETL

From

PG Doc comments form

Date:

03 November 2021, 13:29:19

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/9.6/datatype-character.html
Description:

Text field is a field that is intended for very big texts. 
Performance within Postgres database is commented to be the same as for
normal varchar. 

But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size. 

So the conclusion would be to use an appropriate character length that
reflects max string size.

Re: text fields and performance for ETL

From

Bruce Momjian

Date:

03 November 2021, 14:07:55

On Wed, Nov  3, 2021 at 01:29:19PM +0000, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/9.6/datatype-character.html
> Description:
> 
> Text field is a field that is intended for very big texts. 
> Performance within Postgres database is commented to be the same as for
> normal varchar. 
> 
> But performance in ETL processes related to such data type is decreased
> dramatically, because any process that takes this kind of data needs to
> calculate its size on a row level and cannot take bigger chunks of data
> based on max size. 
> 
> So the conclusion would be to use an appropriate character length that
> reflects max string size.

I have no idea what you are saying above.  There is no additional
overhead for TEXT vs VARCHAR() in Postgres so it seems like an overhead
in applications.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: text fields and performance for ETL

From

"David G. Johnston"

Date:

03 November 2021, 14:38:19

On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form <noreply@postgresql.org> wrote:

But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.

All of my ETL simply reads in the entire contents of a text field. There is no chunking. The documentation assumes that the sizes involved here are reasonable for such behavior. If you have a situation where you've chosen to use varchar(n) and can defend that choice more power to you. Those special circumstances are not of particular interest here. For the vast majority of users they use varchar(n) because they (or more likely their teachers) come from systems where it is required. The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.

David J.

RE: text fields and performance for ETL

From

Grega Jesih

Date:

03 November 2021, 15:35:51

Dear David,

> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.

Then it should say so. Because unexperienced reader then uses this limited focus and generalizes.

It is for the very same reason that progreammers don't perceive the need to limit the string size to its realistic size that ETL processes

are slowed down.

For example a currency where 3-char encoding is used, the field should be char(3) and not text.

> The documentation assumes that the sizes involved here are reasonable for such behavior

On the contrary. When you say "performance is the same.." then you make a wrong impression it is an unversal case. But, if fact it depends.

When you include such table into some dataflow from server A to server B, this process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values that make sense for such cases.

IF you teach instead "ah it is not important, you may use text", then you actually ignore a part of informatic team that provides the usage of this data.

With this you don't contribute. You create a problem.

> All of my ETL simply reads

Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.

In this case you go 100x faster as a rule of thumb. When dealing with billions of records, it makes a biiig difference.

Best regards
Grega

PS
I work in actual-it.si and gmail.com mail is fwded to me. So I took a shorcut here..

From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Wednesday, November 3, 2021 3:38 PM
To: grega.jesih@gmail.com; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL

On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form <noreply@postgresql.org> wrote:

But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.

David J.

NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers.
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.

Re: text fields and performance for ETL

From

"David G. Johnston"

Date:

03 November 2021, 16:00:20

On Wed, Nov 3, 2021 at 8:35 AM Grega Jesih <Grega.Jesih@actual-it.si> wrote:

Dear David,
> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.
Then it should say so. Because unexperienced reader then uses this limited focus and generalizes.

It does (but maybe it could be improved, see the FAQ entry linked below for more detail).

It is for the very same reason that progreammers don't perceive the need to limit the string size to its realistic size that ETL processes
are slowed down.

Given the number of people I find agreeing with "just use text" and the general lack of people making good arguments for using varchar(n) I'm inclined to believe the status quo best reflects the majority of usage in the wild.

For example a currency where 3-char encoding is used, the field should be char(3) and not text.

The char data type has its own problems (see the other FAQ entry linked below)

> The documentation assumes that the sizes involved here are reasonable for such behavior

On the contrary. When you say "performance is the same.." then you make a wrong impression it is an unversal case. But, if fact it depends.

As far as PostgreSQL itself is concerned there is no "it depends". That is all we are claiming here. And, frankly, generalization is correct in probably 90% of situations. Maybe there is room for improvement, in documentation that is usually the case, do you have a concrete suggestion to offer?

When you include such table into some dataflow from server A to server B, this process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values that make sense for such cases.

Teaching data modelling techniques isn't really a goal for our documentation. We aim to inform how the PostgreSQL works.

IF you teach instead "ah it is not important, you may use text", then you actually ignore a part of informatic team that provides the usage of this data.

In most cases (how wide should a name field be) there is no good length to choose. For the currency abbreviation example I would add a "check length(currency_name) = 3" constraint alongside a unique constraint on the lookup table - but the data type would still just be text. I can also enforce all uppercase and letters only for the symbol in the formal constraint while the char(3) will happily allow a value of "u6D". In either case the actual performance of processing that text field (input and output) would be the same in PostgreSQL. So if servers A and B are both PostgreSQL you are simply incorrect. If they are not then the example is largely out-of-scope for our documentation.

(I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.

varchar(n) says nothing about the minimum size allowed which means it does nothing to help for "known record size". For that you need, and have, actual constraints.

If anything the documentation lacks in making these points clear by the very evidence of two FAQ entries covering the topic.

https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_varchar.28n.29_by_default

https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29

David J.

Re: text fields and performance for ETL

From

Tom Lane

Date:

03 November 2021, 17:37:28

Grega Jesih <Grega.Jesih@actual-it.si> writes:
>> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.

> Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in
casethere is a known record size. 

That's a matter for those tools to document, not us, because it's their
performance characteristics that you are talking about, not ours.

I'll just point out in passing, though, that "fixed record size" for text
strings is an illusion as soon as you have to deal with non-ASCII data.
So I'm skeptical that such optimizations are worth much anymore.

            regards, tom lane

Re: text fields and performance for ETL

From

Grega Jesih

Date:

03 November 2021, 18:09:28

Dear Tom and David,

>I'll just point out in passing, though, that "fixed record size" for text
>strings is an illusion as soon as you have to deal with non-ASCII data.
>So I'm skeptical that such optimizations are worth much anymore.

It doesn't matter is it ascii or not. A string is a string. An UTF-8 is also one, just differently coded.

It matters a lot. It means time saving. Plenty of time. So we're talking performance. Not postgres performance, interface performance.

The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.

But what I am trying to achieve is that you describe a view local to database itself and and broader view, integration.

There is also the inner aspect of a database model where for a currency of true size 3 you choose char(3) instead of text.

You prevent a dummy insert on a database level. So no text len>3 may enter this field.

BR
Grega

Od: Tom Lane <tgl@sss.pgh.pa.us>
Poslano: 3. november 2021 18:37:28
Za: Grega Jesih
Kp: David G. Johnston; grega.jesih@gmail.com; Pg Docs
Zadeva: Re: text fields and performance for ETL

Grega Jesih <Grega.Jesih@actual-it.si> writes:
>> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.

> Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.

That's a matter for those tools to document, not us, because it's their
performance characteristics that you are talking about, not ours.

I'll just point out in passing, though, that "fixed record size" for text
strings is an illusion as soon as you have to deal with non-ASCII data.
So I'm skeptical that such optimizations are worth much anymore.

regards, tom lane

Re: text fields and performance for ETL

From

"David G. Johnston"

Date:

03 November 2021, 18:40:18

On Wed, Nov 3, 2021 at 11:09 AM Grega Jesih <Grega.Jesih@actual-it.si> wrote:

The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.

Can you demonstrate, with actual numbers, using today's implementation, a situation where defining a column as char(3) or varchar(3) instead of text has a significant performance improvement? Without a concrete example to examine I'm unable to be convinced to move away from the status quo.

You also need to convince me as to why constraints are an insufficient feature. i.e., why is char(3) better than (check length(val) = 3)?

Even with all that I'd probably still not do anything beyond reviewing a proposed patch (i.e, I wouldn't try to write one myself from scratch...I don't have authority to commit regardless).

David J.

Re: text fields and performance for ETL

From

Tom Lane

Date:

03 November 2021, 18:50:47

Grega Jesih <Grega.Jesih@actual-it.si> writes:
> It matters a lot. It means time saving. Plenty of time. So we're talking performance. Not postgres performance,
interfaceperformance. 

One more time: our docs are here to explain Postgres performance.
It is very easy to show that char/varchar are strictly worse than
text so far as Postgres is concerned.  Other tools need to document
their own performance considerations in their own docs.

Now, if you have a *semantic* consideration, like "a state abbreviation
should be exactly two characters", those datatypes might help you
with enforcing that.  But that's not a performance consideration.

            regards, tom lane

RE: text fields and performance for ETL

From

Grega Jesih

Date:

04 November 2021, 07:15:49

David,

> Can you demonstrate, with actual numbers, using today's implementation, a situation where defining a column as char(3) or varchar(3) instead of text has a significant performance improvement?

Sure I can.

But first, what am I to prove ? A performance in dataflow from server A to server B.

What is that performance change based on ? It is based on a fact that the tool that pumps the data may calculate block size and thus work with several

rows as a time instead of a row at a time.

Is this Postgres performance related ? NO. It is model design related.

So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,

the problematic lack of model design comes out.

Now for the demo: I can make a video to see it, but here is a column in context that talks about the thing:

https://dba.stackexchange.com/questions/102830/avoiding-row-by-row-fetch-method-when-dealing-with-source-lob-columns

If this contribution from stackexchange isn't enough, let me know.

BR
Grega

From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Wednesday, November 3, 2021 7:40 PM
To: Grega Jesih <Grega.Jesih@actual-it.si>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; grega.jesih@gmail.com; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL

On Wed, Nov 3, 2021 at 11:09 AM Grega Jesih <Grega.Jesih@actual-it.si> wrote:

The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.

You also need to convince me as to why constraints are an insufficient feature. i.e., why is char(3) better than (check length(val) = 3)?

Even with all that I'd probably still not do anything beyond reviewing a proposed patch (i.e, I wouldn't try to write one myself from scratch...I don't have authority to commit regardless).

David J.

Re: text fields and performance for ETL

From

"David G. Johnston"

Date:

04 November 2021, 13:50:45

On Thursday, November 4, 2021, Grega Jesih <Grega.Jesih@actual-it.si> wrote:

So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,
the problematic lack of model design comes out.

If you wish to supply an actual patch for consideration I’d review it. Absent that the documentation serves the vast majority of readers well as-is. We’re entitled to a bit of self-centeredness here, especially when the broader world is so varied.

David J.

RE: text fields and performance for ETL

From

Grega Jesih

Date:

05 November 2021, 09:51:08

Hi David.

> We’re entitled to a bit of self-centeredness here, especially when the broader world is so varied

One can tell ;-)

Current text:

Suggested current text addendum:

But, if you consider doing ETL from Postgres database to some outer target environment and you seek performance in such interfaces, follow the logic of limited size (varchar or char) data types in your database model. Because if you make a dataflow of known size types, interface code can take a big block of data while for text fields you need to check each record.

Optional additional remark:

Another good aspect of known data sizes is easier understanding of field content and implicit data (length) control.

Best regards
Grega

From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Thursday, November 4, 2021 2:51 PM
To: Grega Jesih <Grega.Jesih@actual-it.si>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL

On Thursday, November 4, 2021, Grega Jesih <Grega.Jesih@actual-it.si> wrote:

So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,
the problematic lack of model design comes out.

David J.

Attachment

image001.png

Re: text fields and performance for ETL

From

Laurenz Albe

Date:

05 November 2021, 14:15:35

On Fri, 2021-11-05 at 09:51 +0000, Grega Jesih wrote:
> Suggested current text addendum:
> 
> But, if you consider doing ETL from Postgres database to some outer target
> environment and you seek performance in such interfaces, follow the logic
> of limited size (varchar or char) data types in your database model.
> Because if you make a dataflow of known size types, interface code can take
> a big block of data while for text fields you need to check each record.

I am opposed to that.

It is not our business to discuss the limitations of a certain third-party software product.
If that were something wide-spread, perhaps.  But I myself have never seen a problem
with "text", as long as the actual size of the data is moderate.

> Optional additional remark:
> 
> Another good aspect of known data sizes is easier understanding of field
> content and implicit data (length) control.

Something like that makes more sense to me.

Perhaps, right before the tip you quoted, something like that:

  If your use case requires a length limit on character data, or compliance
  with the SQL standard is important, use "character varying".
  Otherwise, you are usually better off with "text".

Yours,
Laurenz Albe

Re: text fields and performance for ETL

From

Bruce Momjian

Date:

05 November 2021, 14:17:12

On Fri, Nov  5, 2021 at 03:15:35PM +0100, Laurenz Albe wrote:
> On Fri, 2021-11-05 at 09:51 +0000, Grega Jesih wrote:
> > Suggested current text addendum:
> > 
> > But, if you consider doing ETL from Postgres database to some outer target
> > environment and you seek performance in such interfaces, follow the logic
> > of limited size (varchar or char) data types in your database model.
> > Because if you make a dataflow of known size types, interface code can take
> > a big block of data while for text fields you need to check each record.
> 
> I am opposed to that.
> 
> It is not our business to discuss the limitations of a certain third-party software product.
> If that were something wide-spread, perhaps.  But I myself have never seen a problem
> with "text", as long as the actual size of the data is moderate.

Agreed.

> > Optional additional remark:
> > 
> > Another good aspect of known data sizes is easier understanding of field
> > content and implicit data (length) control.
> 
> Something like that makes more sense to me.
> 
> Perhaps, right before the tip you quoted, something like that:
> 
>   If your use case requires a length limit on character data, or compliance
>   with the SQL standard is important, use "character varying".
>   Otherwise, you are usually better off with "text".

I can support that if others think it is valuable.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: text fields and performance for ETL

From

"David G. Johnston"

Date:

05 November 2021, 14:32:12

On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote:

>
> Perhaps, right before the tip you quoted, something like that:
>
> If your use case requires a length limit on character data, or compliance
> with the SQL standard is important, use "character varying".
> Otherwise, you are usually better off with "text".

I can support that if others think it is valuable.

The motivating complaint is that we should be encouraging people to use varchar(4000) instead of text so external tools can optimize. If we are not going to do that I really don’t see the pointing in changing away from out current position of “only use text”. True length limit requirements for data are rare, and better done in constraints along with all other the other constraint that may exist for the data. I believe comments with respect to the SQL standard are already present and adequate.

David J.

Re: text fields and performance for ETL

From

Bruce Momjian

Date:

05 November 2021, 15:27:33

On Fri, Nov  5, 2021 at 07:32:12AM -0700, David G. Johnston wrote:
> On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote:
> 
> 
>     >
>     > Perhaps, right before the tip you quoted, something like that:
>     >
>     >   If your use case requires a length limit on character data, or
>     compliance
>     >   with the SQL standard is important, use "character varying".
>     >   Otherwise, you are usually better off with "text".
> 
>     I can support that if others think it is valuable.
> 
> 
> 
> The motivating complaint is that we should be encouraging people to use varchar
> (4000) instead of text so external tools can optimize.  If we are not going to
> do that I really don’t see the pointing in changing away from out current
> position of “only use text”.  True length limit requirements for data are rare,
> and better done in constraints along with all other the other constraint that
> may exist for the data.  I believe comments with respect to the SQL standard
> are already present and adequate.

Agreed.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: text fields and performance for ETL

From

Laurenz Albe

Date:

06 November 2021, 05:27:35

On Fri, 2021-11-05 at 11:27 -0400, Bruce Momjian wrote:
> On Fri, Nov  5, 2021 at 07:32:12AM -0700, David G. Johnston wrote:
> > On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote:
> > 
> > 
> >     >
> >     > Perhaps, right before the tip you quoted, something like that:
> >     >
> >     >   If your use case requires a length limit on character data, or
> >     compliance
> >     >   with the SQL standard is important, use "character varying".
> >     >   Otherwise, you are usually better off with "text".
> > 
> >     I can support that if others think it is valuable.
> > 
> > 
> > 
> > The motivating complaint is that we should be encouraging people to use varchar
> > (4000) instead of text so external tools can optimize.  If we are not going to
> > do that I really don’t see the pointing in changing away from out current
> > position of “only use text”.  True length limit requirements for data are rare,
> > and better done in constraints along with all other the other constraint that
> > may exist for the data.  I believe comments with respect to the SQL standard
> > are already present and adequate.
> 
> Agreed.

+1, so let's leave it as it is.

Yours,
Laurenz Albe

RE: text fields and performance for ETL

From

Grega Jesih

Date:

11 November 2021, 09:44:37

Dear Moderators,

with this approach, why don't you eliminate char and varchar then ?
Thanks for thinking over.

BR
Grega

-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at>
Sent: Saturday, November 6, 2021 6:28 AM
To: Bruce Momjian <bruce@momjian.us>; David G. Johnston <david.g.johnston@gmail.com>
Cc: Grega Jesih <Grega.Jesih@actual-it.si>; Tom Lane <tgl@sss.pgh.pa.us>; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL

On Fri, 2021-11-05 at 11:27 -0400, Bruce Momjian wrote:
> On Fri, Nov  5, 2021 at 07:32:12AM -0700, David G. Johnston wrote:
> > On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote:
> >
> >
> >     >
> >     > Perhaps, right before the tip you quoted, something like that:
> >     >
> >     >   If your use case requires a length limit on character data,
> > or
> >     compliance
> >     >   with the SQL standard is important, use "character varying".
> >     >   Otherwise, you are usually better off with "text".
> >
> >     I can support that if others think it is valuable.
> >
> >
> >
> > The motivating complaint is that we should be encouraging people to
> > use varchar
> > (4000) instead of text so external tools can optimize.  If we are
> > not going to do that I really don’t see the pointing in changing
> > away from out current position of “only use text”.  True length
> > limit requirements for data are rare, and better done in constraints
> > along with all other the other constraint that may exist for the
> > data.  I believe comments with respect to the SQL standard are already present and adequate.
>
> Agreed.

+1, so let's leave it as it is.

Yours,
Laurenz Albe


NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright
materialof Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual
I.T.immediately by return email or by telephone or facsimile on the above numbers.
 
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should
destroyall copies of them.

Re: text fields and performance for ETL

From

Laurenz Albe

Date:

11 November 2021, 16:04:47

On Thu, 2021-11-11 at 09:44 +0000, Grega Jesih wrote:
> with this approach, why don't you eliminate char and varchar then ?
> Thanks for thinking over.

Two reasons:

- the SQL standard requires these types

- sometimes you want to enforce a length limit, and a "type modifier" like
  varchar(30) is cheaper than a check constraint

Yours,
Laurenz Albe