Thread: tuplesort test coverage

tuplesort test coverage

From
Andres Freund
Date:
Hi,

[1] made me look at tuplesorts test coverage at
https://coverage.postgresql.org/src/backend/utils/sort/tuplesort.c.gcov.html
We don't have coverage for a quite a number of things:
- cluster for expression indexes (line 935)
- sorts exceeding INT_MAX / 2 memory (line 1337), but that seems hard to
  test realistically
- aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)
- in memory backwards scans (lines 1936, 3042)
- *any* coverage for TSS_SORTEDONTAPE (line 1964)
- disk sort skiptuples (line 2325)
- mergeruns without abbrev key (line 2582)
- disk sorts with more than one run (lines 2707, 2789)
- any disk based tuplesort_begin_heap() (lines 3649, 3676)
- Seems copytup_index currently is essentially dead, because
  tuplesort_putindextuplevalues() doesn't use COPYTUP (line 4142)
- any disk based tuplesort_begin_datum (lines 4282, 4323)

I'm pretty unhappy that tuplesort has been whacked around pretty heavily
in the last few years, while *reducing* effective test coverage
noticeably, rather than increasing it.  There's pretty substantial and
nontrivial areas without any tests - do we have actually have any
confidence that they work?

The largest culprits for that seem to be abbreviated keys, the tape
logic overhaul, and the increase of work mem.

Greetings,

Andres Freund



Re: tuplesort test coverage

From
Peter Geoghegan
Date:
On Sun, Oct 13, 2019 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:
> - cluster for expression indexes (line 935)

We've never had coverage of this, but perhaps that can be added now.

> - sorts exceeding INT_MAX / 2 memory (line 1337), but that seems hard to
>   test realistically

I don't think that that can be tested, realistically.

> - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)

Also hard to test -- there was a bug here when abbreviated keys first
went in -- that was detected by amcheck.

All of the places where we abort are essentially the same, though.

> - in memory backwards scans (lines 1936, 3042)
> - *any* coverage for TSS_SORTEDONTAPE (line 1964)

That used to exist, but it went away when we killed replacement selection sort.

> - disk sort skiptuples (line 2325)

Couldn't hurt.

> - mergeruns without abbrev key (line 2582)

mergeruns() doesn't use abbreviated keys -- this code disables their
use in the standard way.

> - disk sorts with more than one run (lines 2707, 2789)
> - any disk based tuplesort_begin_heap() (lines 3649, 3676)

I had to convince Tom to get the coverage of external sorts we have
now. Apparently there are buildfarm animals that are very sensitive to
that cost, that could have substantially increased test runtimes were
we to do more. Perhaps this could be revisited.

> - Seems copytup_index currently is essentially dead, because
>   tuplesort_putindextuplevalues() doesn't use COPYTUP (line 4142)

Yeah, that looks like dead code -- it should just be a stub with a
"can't happen" error.

> I'm pretty unhappy that tuplesort has been whacked around pretty heavily
> in the last few years, while *reducing* effective test coverage
> noticeably, rather than increasing it.

I don't think that that's true, on balance. There are only 1,000 extra
lines of code in tuplesort.c in master compared to 9.4, even though we
added parallel sorts and abbreviated keys, two huge enhancements. In
many ways, tuplesort is now simpler than ever.

> There's pretty substantial and
> nontrivial areas without any tests - do we have actually have any
> confidence that they work?

Everything that you're talking about has existed since v11 came out a
year ago, and most of it is a year or two older than that. So yeah,
I'm pretty confident that it works.

-- 
Peter Geoghegan



Re: tuplesort test coverage

From
Andres Freund
Date:
Hi,

On 2019-10-15 13:05:32 +0100, Peter Geoghegan wrote:
> > - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)
> 
> Also hard to test -- there was a bug here when abbreviated keys first
> went in -- that was detected by amcheck.
> 
> All of the places where we abort are essentially the same, though.

Why is it that hard?  Seems fairly easy to create cases that reliably
abort.

I really don't think it's ok to have as many abbrev abort related paths
without any coverage - the relevant code isn't that trivial. And
something like amcheck really doesn't strike me as sufficient. For one,
it doesn't provide any coverage either. For another, plenty sorts don't
end up in a form that amcheck sees.

Tests aren't just there to verify that the current behaviour isn't
broken, they're also there to allow to develop with some confidence. And
I don't think tuplesort as is really allows that, and e.g. abbreviated
keys made that substantially worse. That happens, but I think it'd be
good if you could help improving the situation.

E.g.
SELECT * FROM (SELECT ('00000000-0000-0000-0000-'||to_char(g.i, '000000000000FM'))::uuid uuid FROM
generate_series(15000,0, -1) g(i)) d ORDER BY uuid
 
reliably triggers abbreviated keys, and it looks to me like that should
be portable.  With a few tweaks it'd be fairly easy to use that to
provide some OK coverage for most the abbrev key cases.


> > - in memory backwards scans (lines 1936, 3042) - *any* coverage for
> > TSS_SORTEDONTAPE (line 1964)
> 
> That used to exist, but it went away when we killed replacement selection sort.

Yes, that's kind of my point? Either that patch reduced coverage, or it
created dead code. Neither is good.



> > - mergeruns without abbrev key (line 2582)
> 
> mergeruns() doesn't use abbreviated keys -- this code disables their
> use in the standard way.

Well, then reformulate the point that we should have coverage for
mergeruns() when initially abbreviated keys were set up.


> > - disk sorts with more than one run (lines 2707, 2789)
> > - any disk based tuplesort_begin_heap() (lines 3649, 3676)
> 
> I had to convince Tom to get the coverage of external sorts we have
> now. Apparently there are buildfarm animals that are very sensitive to
> that cost, that could have substantially increased test runtimes were
> we to do more. Perhaps this could be revisited.

Hm. I'm a bit confused. Isn't all that's required to set a tiny amount
of work_mem? Then it's easy to trigger many passes without a lot of IO?



> > I'm pretty unhappy that tuplesort has been whacked around pretty heavily
> > in the last few years, while *reducing* effective test coverage
> > noticeably, rather than increasing it.
> 
> I don't think that that's true, on balance. There are only 1,000 extra
> lines of code in tuplesort.c in master compared to 9.4, even though we
> added parallel sorts and abbreviated keys, two huge enhancements. In
> many ways, tuplesort is now simpler than ever.

I'm not saying that tuplesort has gotten worse or anything. Just that
there's been too much development without adding tests.


> > There's pretty substantial and
> > nontrivial areas without any tests - do we have actually have any
> > confidence that they work?
> 
> Everything that you're talking about has existed since v11 came out a
> year ago, and most of it is a year or two older than that. So yeah,
> I'm pretty confident that it works.

That's may be true, but there's also basically no way to discover bugs
except manual testing, and users encountering the bugs. That's not good
enough.

Greetings,

Andres Freund



Re: tuplesort test coverage

From
Andres Freund
Date:
Hi,

On 2019-10-24 11:10:34 -0700, Andres Freund wrote:
> On 2019-10-15 13:05:32 +0100, Peter Geoghegan wrote:
> > > - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)
> > 
> > Also hard to test -- there was a bug here when abbreviated keys first
> > went in -- that was detected by amcheck.
> > 
> > All of the places where we abort are essentially the same, though.
> 
> Why is it that hard?  Seems fairly easy to create cases that reliably
> abort.
> 
> I really don't think it's ok to have as many abbrev abort related paths
> without any coverage - the relevant code isn't that trivial. And
> something like amcheck really doesn't strike me as sufficient. For one,
> it doesn't provide any coverage either. For another, plenty sorts don't
> end up in a form that amcheck sees.
> 
> Tests aren't just there to verify that the current behaviour isn't
> broken, they're also there to allow to develop with some confidence. And
> I don't think tuplesort as is really allows that, and e.g. abbreviated
> keys made that substantially worse. That happens, but I think it'd be
> good if you could help improving the situation.
> 
> E.g.
> SELECT * FROM (SELECT ('00000000-0000-0000-0000-'||to_char(g.i, '000000000000FM'))::uuid uuid FROM
generate_series(15000,0, -1) g(i)) d ORDER BY uuid
 
> reliably triggers abbreviated keys, and it looks to me like that should
> be portable.  With a few tweaks it'd be fairly easy to use that to
> provide some OK coverage for most the abbrev key cases.

Here's a first stab at getting the coverage of tuplesort.c to a
satisfying level.  There's still bits uncovered, but that's largely
either a) trace_sort related b) hopefully unreachable stuff c) explain
related. The largest actually missing thing is a disk-based
mark/restore, which probably ought be covered.

I think the the test time of this would still be OK, but if not we could
also work a bit more on that angle.

I'm pretty sure there's some minor copy & paste mistakes in the test,
but I want to get this out there and get some reactions before investing
further time.

Peter, Tom?

- Andres

Attachment

Re: tuplesort test coverage

From
Peter Geoghegan
Date:
On Thu, Oct 24, 2019 at 7:10 PM Andres Freund <andres@anarazel.de> wrote:
> I really don't think it's ok to have as many abbrev abort related paths
> without any coverage - the relevant code isn't that trivial. And
> something like amcheck really doesn't strike me as sufficient. For one,
> it doesn't provide any coverage either. For another, plenty sorts don't
> end up in a form that amcheck sees.

I agree.

> Tests aren't just there to verify that the current behaviour isn't
> broken, they're also there to allow to develop with some confidence. And
> I don't think tuplesort as is really allows that, and e.g. abbreviated
> keys made that substantially worse. That happens, but I think it'd be
> good if you could help improving the situation.

I would like to improve this. I am mostly just pointing out that there
has been resistance to this historically. I am in favor of
mechanically increasing test coverage of tuplesort.c along the lines
you describe. I'm just a bit concerned that Tom or others may see it
differently.

> E.g.
> SELECT * FROM (SELECT ('00000000-0000-0000-0000-'||to_char(g.i, '000000000000FM'))::uuid uuid FROM
generate_series(15000,0, -1) g(i)) d ORDER BY uuid
 
> reliably triggers abbreviated keys, and it looks to me like that should
> be portable.  With a few tweaks it'd be fairly easy to use that to
> provide some OK coverage for most the abbrev key cases.

I agree.

> Yes, that's kind of my point? Either that patch reduced coverage, or it
> created dead code. Neither is good.

I agree.

> > mergeruns() doesn't use abbreviated keys -- this code disables their
> > use in the standard way.
>
> Well, then reformulate the point that we should have coverage for
> mergeruns() when initially abbreviated keys were set up.

That doesn't seem essentially, but I'm okay with it.

> > I had to convince Tom to get the coverage of external sorts we have
> > now. Apparently there are buildfarm animals that are very sensitive to
> > that cost, that could have substantially increased test runtimes were
> > we to do more. Perhaps this could be revisited.
>
> Hm. I'm a bit confused. Isn't all that's required to set a tiny amount
> of work_mem? Then it's easy to trigger many passes without a lot of IO?

Yes, but Tom felt that this might not be good enough when this was
discussed in 2016. However, I seem to recall that he was pleasantly
surprised at how small the overhead turned out to be.

It's hard for me to test how much overhead this will have on a machine
with horribly slow I/O. Though I just bought a new Raspberry Pi, and
could test on that when I get back home from my trip to Europe -- it
uses an SD card, which is pretty slow.

> I'm not saying that tuplesort has gotten worse or anything. Just that
> there's been too much development without adding tests.

I agree.

-- 
Peter Geoghegan



Re: tuplesort test coverage

From
Peter Geoghegan
Date:
On Thu, Oct 24, 2019 at 10:10 PM Andres Freund <andres@anarazel.de> wrote:
> Here's a first stab at getting the coverage of tuplesort.c to a
> satisfying level.  There's still bits uncovered, but that's largely
> either a) trace_sort related b) hopefully unreachable stuff c) explain
> related. The largest actually missing thing is a disk-based
> mark/restore, which probably ought be covered.

Yeah. It looks like function coverage of logtape.c will be 100% once
you have coverage of mark and restore.

> I think the the test time of this would still be OK, but if not we could
> also work a bit more on that angle.

That's hard for me to test right now, but offhand this general
approach looks good to me. I am pretty sure it's portable.

-- 
Peter Geoghegan



Re: tuplesort test coverage

From
Andres Freund
Date:
Hi,

On 2019-10-25 12:37:38 +0100, Peter Geoghegan wrote:
> On Thu, Oct 24, 2019 at 10:10 PM Andres Freund <andres@anarazel.de> wrote:
> > Here's a first stab at getting the coverage of tuplesort.c to a
> > satisfying level.  There's still bits uncovered, but that's largely
> > either a) trace_sort related b) hopefully unreachable stuff c) explain
> > related. The largest actually missing thing is a disk-based
> > mark/restore, which probably ought be covered.
> 
> Yeah. It looks like function coverage of logtape.c will be 100% once
> you have coverage of mark and restore.

Yea, it's definitely better after.


> > I think the the test time of this would still be OK, but if not we could
> > also work a bit more on that angle.
> 
> That's hard for me to test right now, but offhand this general
> approach looks good to me. I am pretty sure it's portable.

I pushed this now. We'll see what the slower buildfarm animals say. I'll
try to see how long they took in a few days.

Greetings,

Andres Freund



Re: tuplesort test coverage

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I pushed this now. We'll see what the slower buildfarm animals say. I'll
> try to see how long they took in a few days.

friarbird (a CLOBBER_CACHE_ALWAYS animal) just showed a failure in this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-12-12%2006%3A20%3A02

================== pgsql.build/src/test/regress/regression.diffs ===================
diff -U3 /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out
/pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out
--- /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out    2019-11-13 19:54:11.000000000
-0500
+++ /pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out    2019-12-12 08:25:23.000000000 -0500
@@ -625,13 +625,13 @@
                Group Key: a.col12
                Filter: (count(*) > 1)
                ->  Merge Join
-                     Merge Cond: (a.col12 = b.col12)
-                     ->  Sort
-                           Sort Key: a.col12 DESC
-                           ->  Seq Scan on test_mark_restore a
+                     Merge Cond: (b.col12 = a.col12)
                      ->  Sort
                            Sort Key: b.col12 DESC
                            ->  Seq Scan on test_mark_restore b
+                     ->  Sort
+                           Sort Key: a.col12 DESC
+                           ->  Seq Scan on test_mark_restore a
 (14 rows)

 :qry;

Since a and b are exactly the same table, in principle it's a matter of
chance which one the planner will put on the outside of the join.
I think what happened here is that the test ran long enough for
autovacuum/autoanalyze to come along and scan the table, changing its
stats in between where the planner picked up the stats for a and those
for b, and we ended up making the opposite join order choice.

I considered fixing this by adding some restriction clause on b so
that the join order choice isn't such a coin flip.  But it's not
clear that the problem couldn't recur anyway --- the table stats
would change significantly on auto-analyze, since the test script
isn't bothering to create any stats itself.

What seems like a simpler and more reliable fix is to make
test_mark_restore a temp table, thus keeping autovac away from it.
Is there a reason in terms of the test's goals not to do that?

Also ... why in the world does the script drop its tables at the end
with IF EXISTS?  They'd better exist at that point.  I object
to the DROP IF EXISTS up at the top, too.  The regression tests
do not need to be designed to deal with an unpredictable start state,
and coding them to do so can have no effect other than possibly
masking problems.

            regards, tom lane



Re: tuplesort test coverage

From
Andres Freund
Date:
Hi,

On 2019-12-12 09:27:04 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I pushed this now. We'll see what the slower buildfarm animals say. I'll
> > try to see how long they took in a few days.
> 
> friarbird (a CLOBBER_CACHE_ALWAYS animal) just showed a failure in this:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-12-12%2006%3A20%3A02
> 
> ================== pgsql.build/src/test/regress/regression.diffs ===================
> diff -U3 /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out
/pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out
> --- /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out    2019-11-13 19:54:11.000000000
-0500
> +++ /pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out    2019-12-12 08:25:23.000000000 -0500
> @@ -625,13 +625,13 @@
>                 Group Key: a.col12
>                 Filter: (count(*) > 1)
>                 ->  Merge Join
> -                     Merge Cond: (a.col12 = b.col12)
> -                     ->  Sort
> -                           Sort Key: a.col12 DESC
> -                           ->  Seq Scan on test_mark_restore a
> +                     Merge Cond: (b.col12 = a.col12)
>                       ->  Sort
>                             Sort Key: b.col12 DESC
>                             ->  Seq Scan on test_mark_restore b
> +                     ->  Sort
> +                           Sort Key: a.col12 DESC
> +                           ->  Seq Scan on test_mark_restore a
>  (14 rows)
>  
>  :qry;
> 
> Since a and b are exactly the same table, in principle it's a matter of
> chance which one the planner will put on the outside of the join.

Yea.


> I think what happened here is that the test ran long enough for
> autovacuum/autoanalyze to come along and scan the table, changing its
> stats in between where the planner picked up the stats for a and those
> for b, and we ended up making the opposite join order choice.

Sounds reasonable.


> What seems like a simpler and more reliable fix is to make
> test_mark_restore a temp table, thus keeping autovac away from it.
> Is there a reason in terms of the test's goals not to do that?

I can't see any reason. The sorting code shouldn't care about the source
of tuples. I guess there could at some point be tests for parallel
sorting, but that'd just use a different table.


> Also ... why in the world does the script drop its tables at the end
> with IF EXISTS?  They'd better exist at that point.  I object
> to the DROP IF EXISTS up at the top, too.  The regression tests
> do not need to be designed to deal with an unpredictable start state,
> and coding them to do so can have no effect other than possibly
> masking problems.

Well, it makes it a heck of a lot easier to run tests in isolation while
evolving them. While I personally think it's good to leave cleanup for
partial states in for cases where it was helpful during development, I
also don't care about it strongly.

Greetings,

Andres Freund



Re: tuplesort test coverage

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-12-12 09:27:04 -0500, Tom Lane wrote:
>> What seems like a simpler and more reliable fix is to make
>> test_mark_restore a temp table, thus keeping autovac away from it.
>> Is there a reason in terms of the test's goals not to do that?

> I can't see any reason. The sorting code shouldn't care about the source
> of tuples. I guess there could at some point be tests for parallel
> sorting, but that'd just use a different table.

OK, done that way.

>> Also ... why in the world does the script drop its tables at the end
>> with IF EXISTS?  They'd better exist at that point.  I object
>> to the DROP IF EXISTS up at the top, too.  The regression tests
>> do not need to be designed to deal with an unpredictable start state,
>> and coding them to do so can have no effect other than possibly
>> masking problems.

> Well, it makes it a heck of a lot easier to run tests in isolation while
> evolving them. While I personally think it's good to leave cleanup for
> partial states in for cases where it was helpful during development, I
> also don't care about it strongly.

As far as that goes, making the tables temp is an even better solution.

            regards, tom lane