Thread: text fields and performance for ETL
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/9.6/datatype-character.html Description: Text field is a field that is intended for very big texts. Performance within Postgres database is commented to be the same as for normal varchar. But performance in ETL processes related to such data type is decreased dramatically, because any process that takes this kind of data needs to calculate its size on a row level and cannot take bigger chunks of data based on max size. So the conclusion would be to use an appropriate character length that reflects max string size.
On Wed, Nov 3, 2021 at 01:29:19PM +0000, PG Doc comments form wrote: > The following documentation comment has been logged on the website: > > Page: https://www.postgresql.org/docs/9.6/datatype-character.html > Description: > > Text field is a field that is intended for very big texts. > Performance within Postgres database is commented to be the same as for > normal varchar. > > But performance in ETL processes related to such data type is decreased > dramatically, because any process that takes this kind of data needs to > calculate its size on a row level and cannot take bigger chunks of data > based on max size. > > So the conclusion would be to use an appropriate character length that > reflects max string size. I have no idea what you are saying above. There is no additional overhead for TEXT vs VARCHAR() in Postgres so it seems like an overhead in applications. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.
Dear David,
> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.
Then it should say so. Because unexperienced reader then uses this limited focus and generalizes.
It is for the very same reason that progreammers don't perceive the need to limit the string size to its realistic size that ETL processes
are slowed down.
For example a currency where 3-char encoding is used, the field should be char(3) and not text.
> The documentation assumes that the sizes involved here are reasonable for such behavior
On the contrary. When you say "performance is the same.." then you make a wrong impression it is an unversal case. But, if fact it depends.
When you include such table into some dataflow from server A to server B, this process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values that make sense for such cases.
IF you teach instead "ah it is not important, you may use text", then you actually ignore a part of informatic team that provides the usage of this data.
With this you don't contribute. You create a problem.
> All of my ETL simply reads
Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.
In this case you go 100x faster as a rule of thumb. When dealing with billions of records, it makes a biiig difference.
Best regards
Grega
PS
I work in actual-it.si and gmail.com mail is fwded to me. So I took a shorcut here..
From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Wednesday, November 3, 2021 3:38 PM
To: grega.jesih@gmail.com; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form <noreply@postgresql.org> wrote:
But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.
All of my ETL simply reads in the entire contents of a text field. There is no chunking. The documentation assumes that the sizes involved here are reasonable for such behavior. If you have a situation where you've chosen to use varchar(n) and can defend that choice more power to you. Those special circumstances are not of particular interest here. For the vast majority of users they use varchar(n) because they (or more likely their teachers) come from systems where it is required. The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.
David J.
NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers.
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
Dear David,
> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.
Then it should say so. Because unexperienced reader then uses this limited focus and generalizes.
It is for the very same reason that progreammers don't perceive the need to limit the string size to its realistic size that ETL processes
are slowed down.
For example a currency where 3-char encoding is used, the field should be char(3) and not text.
> The documentation assumes that the sizes involved here are reasonable for such behavior
On the contrary. When you say "performance is the same.." then you make a wrong impression it is an unversal case. But, if fact it depends.
When you include such table into some dataflow from server A to server B, this process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values that make sense for such cases.
IF you teach instead "ah it is not important, you may use text", then you actually ignore a part of informatic team that provides the usage of this data.
(I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.
Grega Jesih <Grega.Jesih@actual-it.si> writes: >> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL. > Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in casethere is a known record size. That's a matter for those tools to document, not us, because it's their performance characteristics that you are talking about, not ours. I'll just point out in passing, though, that "fixed record size" for text strings is an illusion as soon as you have to deal with non-ASCII data. So I'm skeptical that such optimizations are worth much anymore. regards, tom lane
Dear Tom and David,
>I'll just point out in passing, though, that "fixed record size" for text
>strings is an illusion as soon as you have to deal with non-ASCII data.
>So I'm skeptical that such optimizations are worth much anymore.
It doesn't matter is it ascii or not. A string is a string. An UTF-8 is also one, just differently coded.
It matters a lot. It means time saving. Plenty of time. So we're talking performance. Not postgres performance, interface performance.
The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.
But what I am trying to achieve is that you describe a view local to database itself and and broader view, integration.
There is also the inner aspect of a database model where for a currency of true size 3 you choose char(3) instead of text.
You prevent a dummy insert on a database level. So no text len>3 may enter this field.
BR
Grega
Poslano: 3. november 2021 18:37:28
Za: Grega Jesih
Kp: David G. Johnston; grega.jesih@gmail.com; Pg Docs
Zadeva: Re: text fields and performance for ETL
>> The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL.
> Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size.
That's a matter for those tools to document, not us, because it's their
performance characteristics that you are talking about, not ours.
I'll just point out in passing, though, that "fixed record size" for text
strings is an illusion as soon as you have to deal with non-ASCII data.
So I'm skeptical that such optimizations are worth much anymore.
regards, tom lane
NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers.
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.
Grega Jesih <Grega.Jesih@actual-it.si> writes: > It matters a lot. It means time saving. Plenty of time. So we're talking performance. Not postgres performance, interfaceperformance. One more time: our docs are here to explain Postgres performance. It is very easy to show that char/varchar are strictly worse than text so far as Postgres is concerned. Other tools need to document their own performance considerations in their own docs. Now, if you have a *semantic* consideration, like "a state abbreviation should be exactly two characters", those datatypes might help you with enforcing that. But that's not a performance consideration. regards, tom lane
David,
> Can you demonstrate, with actual numbers, using today's implementation, a situation where defining a column as char(3) or varchar(3) instead of text has a significant performance improvement?
Sure I can.
But first, what am I to prove ? A performance in dataflow from server A to server B.
What is that performance change based on ? It is based on a fact that the tool that pumps the data may calculate block size and thus work with several
rows as a time instead of a row at a time.
Is this Postgres performance related ? NO. It is model design related.
So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,
the problematic lack of model design comes out.
Now for the demo: I can make a video to see it, but here is a column in context that talks about the thing:
https://dba.stackexchange.com/questions/102830/avoiding-row-by-row-fetch-method-when-dealing-with-source-lob-columns
If this contribution from stackexchange isn't enough, let me know.
BR
Grega
From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Wednesday, November 3, 2021 7:40 PM
To: Grega Jesih <Grega.Jesih@actual-it.si>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; grega.jesih@gmail.com; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 11:09 AM Grega Jesih <Grega.Jesih@actual-it.si> wrote:
The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant.
Can you demonstrate, with actual numbers, using today's implementation, a situation where defining a column as char(3) or varchar(3) instead of text has a significant performance improvement? Without a concrete example to examine I'm unable to be convinced to move away from the status quo.
You also need to convince me as to why constraints are an insufficient feature. i.e., why is char(3) better than (check length(val) = 3)?
Even with all that I'd probably still not do anything beyond reviewing a proposed patch (i.e, I wouldn't try to write one myself from scratch...I don't have authority to commit regardless).
David J.
NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers.
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,
the problematic lack of model design comes out.
Hi David.
> We’re entitled to a bit of self-centeredness here, especially when the broader world is so varied
One can tell ;-)
Current text:
Suggested current text addendum:
But, if you consider doing ETL from Postgres database to some outer target environment and you seek performance in such interfaces, follow the logic of limited size (varchar or char) data types in your database model. Because if you make a dataflow of known size types, interface code can take a big block of data while for text fields you need to check each record.
Optional additional remark:
Another good aspect of known data sizes is easier understanding of field content and implicit data (length) control.
Best regards
Grega
From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Thursday, November 4, 2021 2:51 PM
To: Grega Jesih <Grega.Jesih@actual-it.si>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL
On Thursday, November 4, 2021, Grega Jesih <Grega.Jesih@actual-it.si> wrote:
So why would it then be published ? Because inexperienced programmers take your statement that "it's the same performance" from a wrong perspective, so it would be fair to note, that the remark "it's the same performance" is meant "within any operation inside Postgres database". In the moment we want to take this data elsewhere,
the problematic lack of model design comes out.
If you wish to supply an actual patch for consideration I’d review it. Absent that the documentation serves the vast majority of readers well as-is. We’re entitled to a bit of self-centeredness here, especially when the broader world is so varied.
David J.
NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers.
You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
Attachment
On Fri, 2021-11-05 at 09:51 +0000, Grega Jesih wrote: > Suggested current text addendum: > > But, if you consider doing ETL from Postgres database to some outer target > environment and you seek performance in such interfaces, follow the logic > of limited size (varchar or char) data types in your database model. > Because if you make a dataflow of known size types, interface code can take > a big block of data while for text fields you need to check each record. I am opposed to that. It is not our business to discuss the limitations of a certain third-party software product. If that were something wide-spread, perhaps. But I myself have never seen a problem with "text", as long as the actual size of the data is moderate. > Optional additional remark: > > Another good aspect of known data sizes is easier understanding of field > content and implicit data (length) control. Something like that makes more sense to me. Perhaps, right before the tip you quoted, something like that: If your use case requires a length limit on character data, or compliance with the SQL standard is important, use "character varying". Otherwise, you are usually better off with "text". Yours, Laurenz Albe
On Fri, Nov 5, 2021 at 03:15:35PM +0100, Laurenz Albe wrote: > On Fri, 2021-11-05 at 09:51 +0000, Grega Jesih wrote: > > Suggested current text addendum: > > > > But, if you consider doing ETL from Postgres database to some outer target > > environment and you seek performance in such interfaces, follow the logic > > of limited size (varchar or char) data types in your database model. > > Because if you make a dataflow of known size types, interface code can take > > a big block of data while for text fields you need to check each record. > > I am opposed to that. > > It is not our business to discuss the limitations of a certain third-party software product. > If that were something wide-spread, perhaps. But I myself have never seen a problem > with "text", as long as the actual size of the data is moderate. Agreed. > > Optional additional remark: > > > > Another good aspect of known data sizes is easier understanding of field > > content and implicit data (length) control. > > Something like that makes more sense to me. > > Perhaps, right before the tip you quoted, something like that: > > If your use case requires a length limit on character data, or compliance > with the SQL standard is important, use "character varying". > Otherwise, you are usually better off with "text". I can support that if others think it is valuable. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
>
> Perhaps, right before the tip you quoted, something like that:
>
> If your use case requires a length limit on character data, or compliance
> with the SQL standard is important, use "character varying".
> Otherwise, you are usually better off with "text".
I can support that if others think it is valuable.
On Fri, Nov 5, 2021 at 07:32:12AM -0700, David G. Johnston wrote: > On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > Perhaps, right before the tip you quoted, something like that: > > > > If your use case requires a length limit on character data, or > compliance > > with the SQL standard is important, use "character varying". > > Otherwise, you are usually better off with "text". > > I can support that if others think it is valuable. > > > > The motivating complaint is that we should be encouraging people to use varchar > (4000) instead of text so external tools can optimize. If we are not going to > do that I really don’t see the pointing in changing away from out current > position of “only use text”. True length limit requirements for data are rare, > and better done in constraints along with all other the other constraint that > may exist for the data. I believe comments with respect to the SQL standard > are already present and adequate. Agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Fri, 2021-11-05 at 11:27 -0400, Bruce Momjian wrote: > On Fri, Nov 5, 2021 at 07:32:12AM -0700, David G. Johnston wrote: > > On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > > > Perhaps, right before the tip you quoted, something like that: > > > > > > If your use case requires a length limit on character data, or > > compliance > > > with the SQL standard is important, use "character varying". > > > Otherwise, you are usually better off with "text". > > > > I can support that if others think it is valuable. > > > > > > > > The motivating complaint is that we should be encouraging people to use varchar > > (4000) instead of text so external tools can optimize. If we are not going to > > do that I really don’t see the pointing in changing away from out current > > position of “only use text”. True length limit requirements for data are rare, > > and better done in constraints along with all other the other constraint that > > may exist for the data. I believe comments with respect to the SQL standard > > are already present and adequate. > > Agreed. +1, so let's leave it as it is. Yours, Laurenz Albe
Dear Moderators, with this approach, why don't you eliminate char and varchar then ? Thanks for thinking over. BR Grega -----Original Message----- From: Laurenz Albe <laurenz.albe@cybertec.at> Sent: Saturday, November 6, 2021 6:28 AM To: Bruce Momjian <bruce@momjian.us>; David G. Johnston <david.g.johnston@gmail.com> Cc: Grega Jesih <Grega.Jesih@actual-it.si>; Tom Lane <tgl@sss.pgh.pa.us>; Pg Docs <pgsql-docs@lists.postgresql.org> Subject: Re: text fields and performance for ETL On Fri, 2021-11-05 at 11:27 -0400, Bruce Momjian wrote: > On Fri, Nov 5, 2021 at 07:32:12AM -0700, David G. Johnston wrote: > > On Friday, November 5, 2021, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > > > Perhaps, right before the tip you quoted, something like that: > > > > > > If your use case requires a length limit on character data, > > or > > compliance > > > with the SQL standard is important, use "character varying". > > > Otherwise, you are usually better off with "text". > > > > I can support that if others think it is valuable. > > > > > > > > The motivating complaint is that we should be encouraging people to > > use varchar > > (4000) instead of text so external tools can optimize. If we are > > not going to do that I really don’t see the pointing in changing > > away from out current position of “only use text”. True length > > limit requirements for data are rare, and better done in constraints > > along with all other the other constraint that may exist for the > > data. I believe comments with respect to the SQL standard are already present and adequate. > > Agreed. +1, so let's leave it as it is. Yours, Laurenz Albe NOTICE - NOT TO BE REMOVED. This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright materialof Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T.immediately by return email or by telephone or facsimile on the above numbers. You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroyall copies of them.
On Thu, 2021-11-11 at 09:44 +0000, Grega Jesih wrote: > with this approach, why don't you eliminate char and varchar then ? > Thanks for thinking over. Two reasons: - the SQL standard requires these types - sometimes you want to enforce a length limit, and a "type modifier" like varchar(30) is cheaper than a check constraint Yours, Laurenz Albe