Thread: TEXT column > 1Gb

TEXT column > 1Gb

From
Joe Carlson
Date:
Hello,

I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is
unlimitedlength, I had been unaware of the 1Gb buffer size limitations. 

We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my
designand, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not
amajor disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in
pseudocode) 

CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end
integer,text_chunk TEXT); 

And define functions for inserting and selecting by dividing into 1Mb chunks

CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
DECLARE
  chunk INTEGER := 0;
  key_id ALIAS for $1;
  the_text ALIAS for $2;
  text_chunk TEXT;
BEGIN
  LOOP
    text_chunk := substr(the_text,chunk*1000000,1000000);
    IF length(text_chunk) = 0 THEN
      EXIT;
    END IF;
    INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES
(key_id,chunk,chunk*1000000,(chunk*1000000+length(text_chunk)),text_chunk);
    chunk := chunk + 1;
  END LOOP;
  RETURN chunk;
END;
$$ LANGUAGE plpgsql;

This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the
insertfails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments
>1Gb is also a bad thing. 

I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this
problembefore. Any suggestions other than have the client do the chunking? Can I use a different language binding and
getaround the argument length limitations? 

Thanks




Re: TEXT column > 1Gb

From
Rob Sargent
Date:
On 4/11/23 11:41, Joe Carlson wrote:
Hello,

I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is unlimited length, I had been unaware of the 1Gb buffer size limitations.

We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my design and, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not a major disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in pseudo code)

CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end integer, text_chunk TEXT);

And define functions for inserting and selecting by dividing into 1Mb chunks

CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
DECLARE  chunk INTEGER := 0;  key_id ALIAS for $1;  the_text ALIAS for $2;  text_chunk TEXT;
BEGIN  LOOP    text_chunk := substr(the_text,chunk*1000000,1000000);    IF length(text_chunk) = 0 THEN      EXIT;    END IF;    INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES (key_id,chunk,chunk*1000000,(chunk*1000000+length(text_chunk)),text_chunk);    chunk := chunk + 1;  END LOOP;  RETURN chunk;
END;
$$ LANGUAGE plpgsql;

This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the insert fails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments > 1Gb is also a bad thing.

I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this problem before. Any suggestions other than have the client do the chunking? Can I use a different language binding and get around the argument length limitations?

Thanks



I've hit this same limitation in Java (with write to db).  What is your stack in this case?  Not sure my solution applies.


Re: TEXT column > 1Gb

From
Pavel Stehule
Date:
Hi


út 11. 4. 2023 v 19:42 odesílatel Joe Carlson <jwcarlson@lbl.gov> napsal:
Hello,

I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is unlimited length, I had been unaware of the 1Gb buffer size limitations.

I think so this is some misunderstanding


>>>The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB. (The maximum value that will be allowed for n in the data type declaration is less than that. It wouldn't be useful to change this because with multibyte character encodings the number of characters and bytes can be quite different. If you desire to store long strings with no specific upper limit, use text or character varying without a length specifier, rather than making up an arbitrary length limit.)<<<

My note: nothing is unlimited in this world :-)

1GB is a theoretical limit, but depending on usage, it can sometimes be too much - and the safe limit is about 500MB. Long strings can take too much RAM in some cases.

For longer data you can use large objects LO API https://www.postgresql.org/docs/current/largeobjects.html

It is much better than extra long strings, and the implementation is +/- similar like you proposed.

Regards

Pavel
 

We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my design and, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not a major disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in pseudo code)

CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end integer, text_chunk TEXT);

And define functions for inserting and selecting by dividing into 1Mb chunks

CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
DECLARE
  chunk INTEGER := 0;
  key_id ALIAS for $1;
  the_text ALIAS for $2;
  text_chunk TEXT;
BEGIN
  LOOP
    text_chunk := substr(the_text,chunk*1000000,1000000);
    IF length(text_chunk) = 0 THEN
      EXIT;
    END IF;
    INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES (key_id,chunk,chunk*1000000,(chunk*1000000+length(text_chunk)),text_chunk);
    chunk := chunk + 1;
  END LOOP;
  RETURN chunk;
END;
$$ LANGUAGE plpgsql;

This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the insert fails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments > 1Gb is also a bad thing.

I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this problem before. Any suggestions other than have the client do the chunking? Can I use a different language binding and get around the argument length limitations?

Thanks



Re: TEXT column > 1Gb

From
Joe Carlson
Date:
I’m curious what you learned. I’ve been tripping over the buffer allocation issue when either splitting input text into chunks or aggregating chunks in selects. I’ve decided that I need to move this to client side.

The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.

And we have a menagerie of code to deal with. Legacy Perl for loading. Clients in Perl, Java, node, python. A whole zoo. 

Thanks

On Apr 11, 2023, at 10:51 AM, Rob Sargent <robjsargent@gmail.com> wrote:


On 4/11/23 11:41, Joe Carlson wrote:
Hello,

I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is unlimited length, I had been unaware of the 1Gb buffer size limitations.

We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my design and, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not a major disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in pseudo code)

CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end integer, text_chunk TEXT);

And define functions for inserting and selecting by dividing into 1Mb chunks

CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
DECLARE  chunk INTEGER := 0;  key_id ALIAS for $1;  the_text ALIAS for $2;  text_chunk TEXT;
BEGIN  LOOP    text_chunk := substr(the_text,chunk*1000000,1000000);    IF length(text_chunk) = 0 THEN      EXIT;    END IF;    INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES (key_id,chunk,chunk*1000000,(chunk*1000000+length(text_chunk)),text_chunk);    chunk := chunk + 1;  END LOOP;  RETURN chunk;
END;
$$ LANGUAGE plpgsql;

This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the insert fails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments > 1Gb is also a bad thing.

I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this problem before. Any suggestions other than have the client do the chunking? Can I use a different language binding and get around the argument length limitations?

Thanks



I've hit this same limitation in Java (with write to db).  What is your stack in this case?  Not sure my solution applies.


Re: TEXT column > 1Gb

From
Rob Sargent
Date:
On 4/12/23 08:59, Joe Carlson wrote:
> I’m curious what you learned. I’ve been tripping over the buffer 
> allocation issue when either splitting input text into chunks or 
> aggregating chunks in selects. I’ve decided that I need to move this 
> to client side.
>
> The use case is genomics. Extracting substrings is common. So going to 
> chunked storage makes sense.
>
> And we have a menagerie of code to deal with. Legacy Perl for loading. 
> Clients in Perl, Java, node, python. A whole zoo.
>
> Thanks

My first stab was a java implementation based Selector class which 
worked ok using a fixed buffer size but finding the end of last 
transmission was hokey.  I move to an embedded Tomcat implementation and 
tomcat does the heavy lifting PERFECTLY.  Easily slipped in to both 
client and backend.

rjs




Re: TEXT column > 1Gb

From
Mark Dilger
Date:

> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson@lbl.gov> wrote:
>
> The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.

Are you storing nucleotide sequences as text strings?  If using the simple 4-character (A,C,G,T) alphabet, you can
storefour bases per byte.  If using a nucleotide code 16-character alphabet you can still get two bases per byte.  An
aminoacid 20-character alphabet can be stored 8 bases per 5 bytes, and so forth.  Such a representation might allow you
tostore sequences two or four times longer than the limit you currently hit, but then you are still at an impasse.
Woulda factor or 2x or 4x be enough for your needs?  

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: TEXT column > 1Gb

From
Joe Carlson
Date:
I’ve certainly thought about using a different representation. A factor of 2x would be good, for a while anyway. For
nucleotidesequence, we’d need to consider a 10 character alphabet (A, C, G, T, N and the lower case forms when
representing’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins are not nearly so long so a straight
storageis simpler.) But these would be bigger changes on the client side than storing in chunks so I think this is the
wayto go. 

We’re working with plant genomes, which compared to human chromosomes, are HUGE. One chromosome of fava bean is over a
gig.And pine tree is another monster. This, together with the fact that sequence data collection and assembly have
improvedso much in the past couple years has forced us to rethink a lot of our data storage assumptions.  

* for those curious, especially in plants, much of sequence consists of repetitive element that are remnants of ancient
viruses,simple repeats and the like. For people who want to identify particular functional components in a genome, they
typicallydo not want to search against this sequence but restrict searching to coding regions. But the repetitive
sequenceis still important and we need to keep it. 

> On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson@lbl.gov> wrote:
>>
>> The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.
>
> Are you storing nucleotide sequences as text strings?  If using the simple 4-character (A,C,G,T) alphabet, you can
storefour bases per byte.  If using a nucleotide code 16-character alphabet you can still get two bases per byte.  An
aminoacid 20-character alphabet can be stored 8 bases per 5 bytes, and so forth.  Such a representation might allow you
tostore sequences two or four times longer than the limit you currently hit, but then you are still at an impasse.
Woulda factor or 2x or 4x be enough for your needs?  
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>




Re: TEXT column > 1Gb

From
Benedict Holland
Date:
For documents that long I would seriously consider using large objects and refencing them with their OIDs. Text fields get put in a special location within the database. It's similar (possibly exactly) to using large objects. Also, you can potentially compress them to save space on write and read. 1gb of text is a lot of text. 


Thanks,
Ben

On Wed, Apr 12, 2023, 1:20 PM Joe Carlson <jwcarlson@lbl.gov> wrote:
I’ve certainly thought about using a different representation. A factor of 2x would be good, for a while anyway. For nucleotide sequence, we’d need to consider a 10 character alphabet (A, C, G, T, N and the lower case forms when representing ’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins are not nearly so long so a straight storage is simpler.) But these would be bigger changes on the client side than storing in chunks so I think this is the way to go.

We’re working with plant genomes, which compared to human chromosomes, are HUGE. One chromosome of fava bean is over a gig. And pine tree is another monster. This, together with the fact that sequence data collection and assembly have improved so much in the past couple years has forced us to rethink a lot of our data storage assumptions.

* for those curious, especially in plants, much of sequence consists of repetitive element that are remnants of ancient viruses, simple repeats and the like. For people who want to identify particular functional components in a genome, they typically do not want to search against this sequence but restrict searching to coding regions. But the repetitive sequence is still important and we need to keep it.

> On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson@lbl.gov> wrote:
>>
>> The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.
>
> Are you storing nucleotide sequences as text strings?  If using the simple 4-character (A,C,G,T) alphabet, you can store four bases per byte.  If using a nucleotide code 16-character alphabet you can still get two bases per byte.  An amino acid 20-character alphabet can be stored 8 bases per 5 bytes, and so forth.  Such a representation might allow you to store sequences two or four times longer than the limit you currently hit, but then you are still at an impasse.  Would a factor or 2x or 4x be enough for your needs?
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>



Re: TEXT column > 1Gb

From
Rob Sargent
Date:
On 4/12/23 11:24, Benedict Holland wrote:
> For documents that long I would seriously consider using large objects 
> and refencing them with their OIDs. Text fields get put in a special 
> location within the database. It's similar (possibly exactly) to using 
> large objects. Also, you can potentially compress them to save space 
> on write and read. 1gb of text is a lot of text.
>

I've gone down the premature optimization coded/compressed route and am 
still regretting it. I would either not bother or move to LO as per 
Benedict's suggestion.  The test may come to whether you plan to use 
database features (functions, procedures) to work on the data versus 
feeding it out to the endless list of sequence/genetic analysis tools.   
General compression is robust and fast.  My custom code/comp is likely 
neither :( .








Re: TEXT column > 1Gb

From
Ron
Date:
Must the genome all be in one big file, or can you store them one line per table row?

On 4/12/23 12:19, Joe Carlson wrote:
I’ve certainly thought about using a different representation. A factor of 2x would be good, for a while anyway. For nucleotide sequence, we’d need to consider a 10 character alphabet (A, C, G, T, N and the lower case forms when representing ’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins are not nearly so long so a straight storage is simpler.) But these would be bigger changes on the client side than storing in chunks so I think this is the way to go.

We’re working with plant genomes, which compared to human chromosomes, are HUGE. One chromosome of fava bean is over a gig. And pine tree is another monster. This, together with the fact that sequence data collection and assembly have improved so much in the past couple years has forced us to rethink a lot of our data storage assumptions. 

* for those curious, especially in plants, much of sequence consists of repetitive element that are remnants of ancient viruses, simple repeats and the like. For people who want to identify particular functional components in a genome, they typically do not want to search against this sequence but restrict searching to coding regions. But the repetitive sequence is still important and we need to keep it.

On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:



On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson@lbl.gov> wrote:

The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.
Are you storing nucleotide sequences as text strings?  If using the simple 4-character (A,C,G,T) alphabet, you can store four bases per byte.  If using a nucleotide code 16-character alphabet you can still get two bases per byte.  An amino acid 20-character alphabet can be stored 8 bases per 5 bytes, and so forth.  Such a representation might allow you to store sequences two or four times longer than the limit you currently hit, but then you are still at an impasse.  Would a factor or 2x or 4x be enough for your needs? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






--
Born in Arizona, moved to Babylonia.

Re: TEXT column > 1Gb

From
Rob Sargent
Date:
On 4/12/23 13:02, Ron wrote:
Must the genome all be in one big file, or can you store them one line per table row?

Not sure what OP is doing with plant genomes (other than some genomics) but the tools all use files and pipeline of sub-tools.  In and out of tuples would be expensive.  Very,very little "editing" done in the usual "update table set val where id" sense.

Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage data that only really makes sense to tools, reader.  Highly denormalized of course.  (Btw, I hate sequencing :) )



Re: TEXT column > 1Gb

From
Benedict Holland
Date:
Yea. For ease of use, out of the box solutions that will just work, large objects. You might know them as BLOBS in other SQL varieties. If you are dealing with that much data, I'm going to assume that storage isn't really your concern. I wouldn't even waste time compressing. I use them frequently to store all sorts of wierd objects like pictures or serialized pickle files. They are really fast and extremely easy to use. They do not play nicely with a lot of 3rd party software, particularly UIs sitting on top of a database but again, if that isnt a concern or you can use stored procedures for the selects, it should be just fine. 

On Wed, Apr 12, 2023, 3:21 PM Rob Sargent <robjsargent@gmail.com> wrote:
On 4/12/23 13:02, Ron wrote:
Must the genome all be in one big file, or can you store them one line per table row?

Not sure what OP is doing with plant genomes (other than some genomics) but the tools all use files and pipeline of sub-tools.  In and out of tuples would be expensive.  Very,very little "editing" done in the usual "update table set val where id" sense.

Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage data that only really makes sense to tools, reader.  Highly denormalized of course.  (Btw, I hate sequencing :) )



Re: TEXT column > 1Gb

From
Ron
Date:
On 4/12/23 14:21, Rob Sargent wrote:
On 4/12/23 13:02, Ron wrote:
Must the genome all be in one big file, or can you store them one line per table row?

Not sure what OP is doing with plant genomes (other than some genomics) but the tools all use files and pipeline of sub-tools.  In and out of tuples would be expensive.  Very,very little "editing" done in the usual "update table set val where id" sense.

Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage data that only really makes sense to tools, reader.  Highly denormalized of course.  (Btw, I hate sequencing :) )

My thinking is that you'd never hit a 1GiB limit if every line in the vcf file was in it's it's own row in a table.

Performance would drop, though.

--
Born in Arizona, moved to Babylonia.

Re: TEXT column > 1Gb

From
Joe Carlson
Date:


On Apr 12, 2023, at 12:21 PM, Rob Sargent <robjsargent@gmail.com> wrote:

On 4/12/23 13:02, Ron wrote:
Must the genome all be in one big file, or can you store them one line per table row?

The assumption in the schema I’m using is 1 chromosome per record. Chromosomes are typically strings of continuous sequence (A, C, G, or T) separated by gaps (N) of approximately known, or completely unknown size. In the past this has not been a problem since sequenced chromosomes were maybe 100 megabases. But sequencing is better now with the technology improvements and tackling more complex genomes. So gigabase chromosomes are common. 

A typical use case might be from someone interested in seeing if they can identify the regulatory elements (the on or off switches) of a gene. The protein coding part of a gene can be predicted pretty reliably, but the upstream untranslated region and regulatory elements are tougher. So they might come to our web site and want to extract the 5 kb bit of sequence before the start of the gene and look for some of the common motifs that signify a protein binding site. Being able to quickly pull out a substring of the genome to drive a web app is something we want to do quickly. 

Not sure what OP is doing with plant genomes (other than some genomics) but the tools all use files and pipeline of sub-tools.  In and out of tuples would be expensive.  Very,very little "editing" done in the usual "update table set val where id" sense.

yeah. it’s basically a warehouse. Stick data in, but then make all the connections between the functional elements, their products and the predictions on the products. It’s definitely more than a document store and we require a relational database.

Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage data that only really makes sense to tools, reader.  Highly denormalized of course.  (Btw, I hate sequencing :) )

Imagine a disciplne where some beleaguered grad student has to get something out the door by the end of the term. It gets published and the rest of the community say GREAT! we have a standard! Then the abuse of the standard happens. People who specialize in bioinformatics know just enough computer science, statistics and molecular biology to annoy experts in three different fields.

Re: TEXT column > 1Gb

From
Rob Sargent
Date:
On 4/12/23 15:03, Joe Carlson wrote:


On Apr 12, 2023, at 12:21 PM, Rob Sargent <robjsargent@gmail.com> wrote:

On 4/12/23 13:02, Ron wrote:
Must the genome all be in one big file, or can you store them one line per table row?

The assumption in the schema I’m using is 1 chromosome per record. Chromosomes are typically strings of continuous sequence (A, C, G, or T) separated by gaps (N) of approximately known, or completely unknown size. In the past this has not been a problem since sequenced chromosomes were maybe 100 megabases. But sequencing is better now with the technology improvements and tackling more complex genomes. So gigabase chromosomes are common. 

A typical use case might be from someone interested in seeing if they can identify the regulatory elements (the on or off switches) of a gene. The protein coding part of a gene can be predicted pretty reliably, but the upstream untranslated region and regulatory elements are tougher. So they might come to our web site and want to extract the 5 kb bit of sequence before the start of the gene and look for some of the common motifs that signify a protein binding site. Being able to quickly pull out a substring of the genome to drive a web app is something we want to do quickly. 


Well if you're actually using the sequence, both text and bytea are inherently substring friendly.  Your problem goes back to transferring large strings and that's where http/tomcat is you friend.  Sounds like you're web friendly already.  You have to stream from the client/supplier, of course.