Thread: Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

Tue, 26 Jun 2012 17:04:42 -0400 Robert Haas wrote:

 

> I see you posted up a follow-up email asking Tom what he had in mind.
> Personally, I don't think this needs incredibly complicated testing.
> I think you should just test a workload involving inserting and/or
> updating rows with lots of trailing NULL columns, and then another
> workload with a table of similar width that... doesn't.  If we can't
> find a regression - or, better, we find a win in one or both cases -
> then I think we're done here.

 

As per the last discussion for this patch, performance data needs to be provided before this patch's Review can proceed further.

So as per your suggestion and from the discussions about this patch, I have collected the performance data as below:

 

Results are taken with following configuration.
1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT type.
2. shared_buffers = 10GB
3. All the performance result are taken with single connection.
4. Performance is collected for INSERT operation (insert into temptable select * from inittable)

Platform details:
    Operating System: Suse-Linux 10.2 x86_64
    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
    RAM : 24GB

Documents Attached:
init.sh                    : Which will create the schema
sql_used.sql               : sql's used for taking results

Trim_Nulls_Perf_Report.html             :   Performance data

 

 

Observations from Performance Results

------------------------------------------------

1. There is no performance change for cloumns that have all valid values(non- NULLs).

2. There is a visible performance increase when number of columns containing NULLS are more than > 60~70% in table have large number of columns.          

3. There are visible space savings when number of columns containing NULLS are more than > 60~70% in table have large number of columns.          


Let me know if there is more performance data needs to be collected for this patch?

 

With Regards,

Amit Kapila. 

Attachment
On Saturday, October 13, 2012 1:24 PM Amit kapila wrote:
Tue, 26 Jun 2012 17:04:42 -0400 Robert Haas wrote:

>> I see you posted up a follow-up email asking Tom what he had in mind.
>> Personally, I don't think this needs incredibly complicated testing.
>> I think you should just test a workload involving inserting and/or
>> updating rows with lots of trailing NULL columns, and then another
>> workload with a table of similar width that... doesn't.  If we can't
>> find a regression - or, better, we find a win in one or both cases -
>> then I think we're done here.

>As per the last discussion for this patch, performance data needs to be provided before this patch's Review can
proceed>further. 
>So as per your suggestion and from the discussions about this patch, I have collected the performance data as below:

>Results are taken with following configuration.
>1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT type.
>2. shared_buffers = 10GB
>3. All the performance result are taken with single connection.
>4. Performance is collected for INSERT operation (insert into temptable select * from inittable)

>Platform details:
>    Operating System: Suse-Linux 10.2 x86_64
>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>    RAM : 24GB

Further to Performance data, I have completed the review of the Patch.

Basic stuff:
------------
        - Rebase of Patch is required.
                As heap_fill_tuple function prototype is moved to different file [htup.h to htup_details.h]
        - Compiles cleanly without any errors/warnings
        - Regression tests pass.


Code Review comments:
---------------------
1. There is possibility of memory growth in case of toast table, if trailing toasted columns are updated to NULLs;
i.e. In Function toast_insert_or_update, for tuples when 'need_change' variable is true, numAttrs are modified to last
nonnull column values,  
     and in old tuple de-toasted columns are not getting freed, if this repeats for more number of tuples there is
chanceof out of memory.  

        if (        need_change)
        {
                numAttrs = lastNonNullValOffset + 1;
         ....
        }

        if (need_delold)
                for (i = 0; i < numAttrs; i++) <== Tailing toasted value wouldn't be freed as updated to NULL and
numAttrsis modified to smaller value.  
                        if (toast_delold[i])
                                toast_delete_datum(rel, toast_oldvalues[i]);

2. Comments need to updated in following functions; how ending null columns are skipped in header part.
        heap_fill_tuple - function header
        heap_form_tuple, heap_form_minimal_tuple, heap_form_minimal_tuple.

3. Why following change is required in function toast_flatten_tuple_attribute
        -        numAttrs = tupleDesc->natts;
      +        numAttrs = HeapTupleHeaderGetNatts(olddata);


Detailed Performance Report for Insert and Update Operations is attached with this mail.

Observations from Performance Results
------------------------------------------------
1. There is no performance change for cloumns that have all valid values(non- NULLs).
2. There is a visible performance increase when number of columns containing NULLS are more than > 60~70% in table have
largenumber of columns.           
3. There are visible space savings when number of columns containing NULLS are more than > 60~70% in table have large
numberof columns.  


With Regards,
Amit Kapila.
Attachment
From: pgsql-hackers-owner@postgresql.org [pgsql-hackers-owner@postgresql.org] on behalf of Amit kapila
[amit.kapila@huawei.com]
Sent: Monday, October 15, 2012 7:28 PM
To: robertmhaas@gmail.com; josh@agliodbs.com
Cc: pgsql-hackers@postgresql.org
Subject: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap
[Review]

On Monday, October 15, 2012 7:28 PM Amit kapila wrote:
On Saturday, October 13, 2012 1:24 PM Amit kapila wrote:
Tue, 26 Jun 2012 17:04:42 -0400 Robert Haas wrote:

>> I see you posted up a follow-up email asking Tom what he had in mind.
>> Personally, I don't think this needs incredibly complicated testing.
>> I think you should just test a workload involving inserting and/or
>> updating rows with lots of trailing NULL columns, and then another
>> workload with a table of similar width that... doesn't.  If we can't
>> find a regression - or, better, we find a win in one or both cases -
>> then I think we're done here.

>As per the last discussion for this patch, performance data needs to be provided before this patch's Review can
proceed>further. 
>So as per your suggestion and from the discussions about this patch, I have collected the performance data as below:

>Results are taken with following configuration.
>1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT type.
>2. shared_buffers = 10GB
>3. All the performance result are taken with single connection.
>4. Performance is collected for INSERT operation (insert into temptable select * from inittable)

>Platform details:
>    Operating System: Suse-Linux 10.2 x86_64
>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>    RAM : 24GB

> Further to Performance data, I have completed the review of the Patch.

Please find the patch to address Review Comments attached with this mail.

IMO, now its ready for a committer.


With Regards,
Amit Kapila.
Attachment
On 13 October 2012 08:54, Amit kapila <amit.kapila@huawei.com> wrote:

> As per the last discussion for this patch, performance data needs to be
> provided before this patch's Review can proceed further.
>
> So as per your suggestion and from the discussions about this patch, I have
> collected the performance data as below:
>
>
>
> Results are taken with following configuration.
> 1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT
> type.
> 2. shared_buffers = 10GB
> 3. All the performance result are taken with single connection.
> 4. Performance is collected for INSERT operation (insert into temptable
> select * from inittable)
>
> Platform details:
>     Operating System: Suse-Linux 10.2 x86_64
>     Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>     RAM : 24GB
>
> Documents Attached:
> init.sh                    : Which will create the schema
> sql_used.sql               : sql's used for taking results
>
> Trim_Nulls_Perf_Report.html             :   Performance data
>
>
> Observations from Performance Results
>
> ------------------------------------------------
>
> 1. There is no performance change for cloumns that have all valid
> values(non- NULLs).
>
> 2. There is a visible performance increase when number of columns containing
> NULLS are more than > 60~70% in table have large number of columns.
>
> 3. There are visible space savings when number of columns containing NULLS
> are more than > 60~70% in table have large number of columns.
>
>
> Let me know if there is more performance data needs to be collected for this
> patch?


I can't make sense of your performance report. Because of that I can't
derive the same conclusions from it you do.

Can you explain the performance results in more detail, so we can see
what they mean? Like which are the patched, which are the unpatched
results? Which results are comparable, what the percentages mean etc..

We might then move quickly towards commit, or at least more tests.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On Thursday, December 20, 2012 5:46 PM Simon Riggs wrote:
> On 13 October 2012 08:54, Amit kapila <amit.kapila@huawei.com> wrote:
> 
> > As per the last discussion for this patch, performance data needs to
> be
> > provided before this patch's Review can proceed further.
> >
> > So as per your suggestion and from the discussions about this patch,
> I have

> >
> > ------------------------------------------------
> >
> > 1. There is no performance change for cloumns that have all valid
> > values(non- NULLs).
> >
> > 2. There is a visible performance increase when number of columns
> containing
> > NULLS are more than > 60~70% in table have large number of columns.
> >
> > 3. There are visible space savings when number of columns containing
> NULLS
> > are more than > 60~70% in table have large number of columns.
> >
> >
> > Let me know if there is more performance data needs to be collected
> for this
> > patch?
> 
> 
> I can't make sense of your performance report. Because of that I can't
> derive the same conclusions from it you do.
> 
> Can you explain the performance results in more detail, so we can see
> what they mean? Like which are the patched, which are the unpatched
> results? 
On the extreme let it is mentioned Original Code/ Trim Triling Nulls Patch.
In any case I have framed the results again as below:
1. Table with 800 columns 
A. INSERT tuples with 600 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+---------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+---------------------- 
1A:  0.2068  250000  | 0.2302  222223      | 10.1% tps, 11.1% space 
1B:  0.0448  500000  | 0.0481  472223      |  6.8% tps,  5.6% space 
1C:  0.0433  750000  | 0.0493  694445      | 12.2% tps,  7.4% space 

2. Table with 800 columns 
A. INSERT tuples with 300 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+---------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+---------------------- 
2A:  0.0280   666667 | 0.0287   666667     | 2.3% tps, 0% space 
2B:  0.0143  1333334 | 0.0152  1333334     | 5.3% tps, 0% space 
2C:  0.0145  2000000 | 0.0149  2000000     | 2.9% tps, 0% space 

3. Table with 300 columns 
A. INSERT tuples with 150 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+--------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+-------------------- 
3A:  0.2815  166667  | 0.2899  166667      | 2.9% tps, 0% space 
3B:  0.0851  333334  | 0.0870  333334      | 2.2% tps, 0% space 
3C:  0.0846  500000  | 0.0852  500000      | 0.7% tps, 0% space 

4. Table with 300 columns 
A. INSERT tuples with 250 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+-------------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+------------------------- 
4A:  0.5447    66667 | 0.5996   58824      |  09.2% tps, 11.8% space 
4B:  0.1251   135633 | 0.1232  127790      | -01.5% tps,  5.8% space 
4C:  0.1223   202299 | 0.1361  186613      |  10.1% tps,  7.5% space

Please let me know, if still it is not clear.

With Regards,
Amit Kapila.





On 20 December 2012 14:56, Amit Kapila <amit.kapila@huawei.com> wrote:

>> > 1. There is no performance change for cloumns that have all valid
>> > values(non- NULLs).

I don't see any tests (at all) that measure this.

I'm particularly interested in lower numbers of columns, so we can
show no regression for the common case.


>> > 2. There is a visible performance increase when number of columns
>> containing
>> > NULLS are more than > 60~70% in table have large number of columns.
>> >
>> > 3. There are visible space savings when number of columns containing
>> NULLS
>> > are more than > 60~70% in table have large number of columns.

Agreed.

I would call that quite disappointing though and was expecting better.
Are we sure the patch works and the tests are correct?

The lack of any space saving for lower % values is strange and
somewhat worrying. There should be a 36? byte saving for 300 null
columns in an 800 column table - how does that not show up at all?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Simon Riggs <simon@2ndQuadrant.com> writes:
> The lack of any space saving for lower % values is strange and
> somewhat worrying. There should be a 36? byte saving for 300 null
> columns in an 800 column table - how does that not show up at all?

You could only fit about 4 such rows in an 8K page (assuming the columns
are all int4s).  Unless the savings is enough to allow 5 rows to fit in
a page, the effective savings will be zilch.

This may well mean that the whole thing is a waste of time in most
scenarios --- the more likely it is to save anything, the more likely
that the savings will be lost anyway due to page alignment
considerations, because wider rows inherently pack less efficiently.
        regards, tom lane



On 23 December 2012 17:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> The lack of any space saving for lower % values is strange and
>> somewhat worrying. There should be a 36? byte saving for 300 null
>> columns in an 800 column table - how does that not show up at all?
>
> You could only fit about 4 such rows in an 8K page (assuming the columns
> are all int4s).  Unless the savings is enough to allow 5 rows to fit in
> a page, the effective savings will be zilch.

If that's the case, the use case is tiny, especially considering how
sensitive the saving is to the exact location of the NULLs.

> This may well mean that the whole thing is a waste of time in most
> scenarios --- the more likely it is to save anything, the more likely
> that the savings will be lost anyway due to page alignment
> considerations, because wider rows inherently pack less efficiently.

ISTM that we'd get a better gain and a wider use case by compressing
the whole block, with some bits masked out to allow updates/deletes.
The string of zeroes in the null bitmap would compress easily, but so
would other aspects also.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On Sunday, December 23, 2012 8:11 PM Simon Riggs wrote:
> On 20 December 2012 14:56, Amit Kapila <amit.kapila@huawei.com> wrote:
> 
> >> > 1. There is no performance change for cloumns that have all valid
> >> > values(non- NULLs).
> 
> I don't see any tests (at all) that measure this.
> 
> I'm particularly interested in lower numbers of columns, so we can
> show no regression for the common case.

For now I have taken for 300 columns, I can take for 10~30 columns reading
as well if required

1. Table with 300 columns (all integer columns) 
A. INSERT tuples without trailing nulls 
B. UPDATE last column value to "null" 
----------------+---------------------+------------------ 
Original Code   |  Trim Tailing NULLs | Improvement (%)       TPS        |  TPS                | Results 
----------------+---------------------+------------------ 
1A: 0.1348      | 0.1352              |  0.3%   
1B: 0.0495      | 0.0495              |  0.0%

> >> > 2. There is a visible performance increase when number of columns
> >> containing
> >> > NULLS are more than > 60~70% in table have large number of
> columns.
> >> >
> >> > 3. There are visible space savings when number of columns
> containing
> >> NULLS
> >> > are more than > 60~70% in table have large number of columns.
> 
> Agreed.
> 
> I would call that quite disappointing though and was expecting better.
> Are we sure the patch works and the tests are correct?
> 
> The lack of any space saving for lower % values is strange and
> somewhat worrying. There should be a 36? byte saving for 300 null
> columns in an 800 column table - how does that not show up at all?

300 NULL's case will save approximately 108 bytes, as 3 tuples will be
accommodated in such case.
So now the total space left in page will be approximately 1900 bytes
(including 108 bytes saved by optimization).
Now the point is that in existing test case all rows are same (approx 2100
bytes), so no space saving is shown, but incase the last row is such that it
can get accommodated in space saved (remaining space of page + space saved
due to NULLS optimization), then it can show space savings as well.

In anycase there is a performance gain for 300 NULLS case as well.

Apart from above, the performance data for less number of columns (where the
trailing nulls are such that they cross word boundary) also show similar
gains:

The below cases (2 & 3) can give benefit as it will save 4 bytes per tuple

2. Table with 12 columns (first 3 integer followed by 9 Boolean columns) 
A. INSERT tuples with 9 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+---------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+---------------------- 
2A:  0.8485   12739  |  0.8524   10811     |  0.4%  15.1% 
2B:  0.5847   25478  |  0.5749   23550     | -1.5%   7.5% 
2C:  0.5591   38217  |  0.5545   34361     |  0.8%  10.0% 


3. Table with 12 columns  (first 3 integer followed by 9 Boolean columns) 
A. INSERT tuples with 4 trailing nulls 
B. UPDATE last column value to "non-null" 
C. UPDATE last column value to "null" 
---------------------+---------------------+---------------------     Original Code   |  Trim Tailing NULLs |
Improvement(%)       TPS   space used|  TPS    space used  | Results           (pages)   |         (pages)     |     
 
---------------------+---------------------+---------------------- 
3A:  0.8443   14706  |  0.8626   12739     |  2.3%  13.3% 
3B:  0.5307   29412  |  0.5272   27445     | -0.6%   6.7% 
3C:  0.5102   44118  |  0.5218   40184     |  2.2%   8.9%

As a conclusion point, I would like to say that this patch doesn't have
performance regression for most used scenario's
and it gives benefit in some of the trailing null's cases.  

With Regards,
Amit Kapila.




On 24 December 2012 13:13, Amit Kapila <amit.kapila@huawei.com> wrote:

> Apart from above, the performance data for less number of columns (where the
> trailing nulls are such that they cross word boundary) also show similar
> gains:
>
> The below cases (2 & 3) can give benefit as it will save 4 bytes per tuple
>
> 2. Table with 12 columns (first 3 integer followed by 9 Boolean columns)
> A. INSERT tuples with 9 trailing nulls
> B. UPDATE last column value to "non-null"
> C. UPDATE last column value to "null"
> ---------------------+---------------------+---------------------
>      Original Code   |  Trim Tailing NULLs | Improvement (%)
>      TPS   space used|  TPS    space used  | Results
>            (pages)   |         (pages)     |
> ---------------------+---------------------+----------------------
> 2A:  0.8485   12739  |  0.8524   10811     |  0.4%  15.1%
> 2B:  0.5847   25478  |  0.5749   23550     | -1.5%   7.5%
> 2C:  0.5591   38217  |  0.5545   34361     |  0.8%  10.0%
>
>
> 3. Table with 12 columns  (first 3 integer followed by 9 Boolean columns)
> A. INSERT tuples with 4 trailing nulls
> B. UPDATE last column value to "non-null"
> C. UPDATE last column value to "null"
> ---------------------+---------------------+---------------------
>      Original Code   |  Trim Tailing NULLs | Improvement (%)
>      TPS   space used|  TPS    space used  | Results
>            (pages)   |         (pages)     |
> ---------------------+---------------------+----------------------
> 3A:  0.8443   14706  |  0.8626   12739     |  2.3%  13.3%
> 3B:  0.5307   29412  |  0.5272   27445     | -0.6%   6.7%
> 3C:  0.5102   44118  |  0.5218   40184     |  2.2%   8.9%
>
> As a conclusion point, I would like to say that this patch doesn't have
> performance regression for most used scenario's
> and it gives benefit in some of the trailing null's cases.

Not really sure about the 100s of columns use case.

But showing gain in useful places in these more common cases wins my vote.

Thanks for testing. Barring objections, will commit.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services