Poor Performance running Django unit tests after upgrading from 10.6 - Mailing list pgsql-performance

From Roger Hunwicks
Subject Poor Performance running Django unit tests after upgrading from 10.6
Date
Msg-id CAKcHf7vihy53fMotv9NpqCASP5AL+8KmME18ysqiD4xYtNEPMQ@mail.gmail.com
Whole thread Raw
Responses Re: Poor Performance running Django unit tests after upgrading from 10.6
Re: Poor Performance running Django unit tests after upgrading from 10.6
List pgsql-performance
We have a large Django application running against a Postgresql database.

The test suite for the application runs in GitLab-CI using Docker
Compose to run the unit tests inside the application container against
a Postgresql database running in another container.

When Django runs the unit tests for the application, it starts by
creating a new test database and then it runs the database migrations
to create all the tables and the necessary reference data.Then for
each test it opens a transaction, runs the test and then then rolls
back the transaction. This ensures that the database is "clean" before
each test run and reduces the risk that data created by one test will
cause a different test to fail. Consequently this means that almost
all the tables in the test database have zero or very few rows in
them. It also means that the statistics for the tables in the test
database are pretty meaningless. The statistics, if they exist, will
probably say there are zero rows, and the query will actually be
dealing with 0 - 10 rows that are visible in the transaction, but
which will be rolled back.

When we run the test suite using Postgresql 10.7 in a Docker container
we consistently get:

Ran 1166 tests in 1291.855s

When we first started running the same test suite against Postgresql
12.4 we got:

Ran 1166 tests in 8502.030s

I think that this reduction in performance is caused by the lack of
accurate statistics because we had a similar problem (a large
reduction in performance) in a load test that we used which we cured
by running ANALYZE after creating the test data and before running the
load test. The load test is using the same Django application code,
but creates a "large amount" of test data (in the 100s to 1000s of
rows per table - it is looking for N+1 query problems rather than
absolute performance".

We have since managed to get the performance of the test run using
12.4 back to approximately the normal range by customizing the
Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11`
seem to be key, but we are also setting `shared_buffers`, etc. and all
the other typical parameters. With Postgresql 10.7 we weren't setting
anything and performance was fine using just the defaults, given the
tiny data volumes.

However, even though we have similar performance for 12.4 for most
test runs, it remains very variable. About 30% of the time we get
something like:

Ran 1166 tests in 5362.690s.

We also see similar performance reductions and inconsistent results
with 11.9, so whatever change is causing the problem was likely
introduced in 11 rather than in 12.

I think we have narrowed down the problem to a single, very complex,
materialized view using CTEs; the unit tests create the test data and
then refresh the materialized view before executing the actual test
code.

Database logging using autoexplain shows things like:

db_1             | 2020-10-14 10:27:59.692 UTC [255] LOG:  duration:
4134.625 ms  plan:
db_1             |     Query Text: REFRESH MATERIALIZED VIEW
price_marketpricefacts_materialized
db_1             |     Merge Join
(cost=14141048331504.30..9635143213364288.00 rows=116618175994107184
width=3302) (actual time=4134.245..4134.403 rows=36 loops=1)

For comparison, the equivalent query on 10.7 has:

db_1             | 2020-10-15 03:28:58.382 UTC [163] LOG:  duration:
10.500 ms  plan:
db_1             |     Query Text: REFRESH MATERIALIZED VIEW
price_marketpricefacts_materialized
db_1             |     Hash Left Join  (cost=467650.55..508612.80
rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1)

The staggering cost implies that the statistics are badly wrong, but
given how few rows are in the result (36, and it's not an aggregate) I
would expect the query to be fast regardless of what the plan is. In
10.7 the materialized view refreshes in 150 ms or

I also don't understand why the performance would be so inconsistent
across test runs for 12.4 but not for 10.7. It is as though sometimes
it gets a good plan and sometimes it doesn't.

I can get performance almost identical to 10.7 by altering the unit
tests so that in each test that refreshes the materialized view prior
to executing the query, we execute `ANALYZE;` prior to refreshing the
view.

Is it worth us trying to debug the plan for situations with low row
counts and poor statistics? Or is this use case not really covered:
the general advice is obviously to make sure that statistics are up to
date before troubleshooting performance problems. On the other hand,
it is not easy for us to make sure that we run analyze inside the
transaction in each unit test; it also seems a bit wasteful.

Opinions and advice gratefully received.

Roger



pgsql-performance by date:

Previous
From: Michael Lewis
Date:
Subject: Re: Slow Query
Next
From: Andrew Dunstan
Date:
Subject: Re: Poor Performance running Django unit tests after upgrading from 10.6