Poor Performance running Django unit tests after upgrading from 10.6 - Mailing list pgsql-performance
From | Roger Hunwicks |
---|---|
Subject | Poor Performance running Django unit tests after upgrading from 10.6 |
Date | |
Msg-id | CAKcHf7vihy53fMotv9NpqCASP5AL+8KmME18ysqiD4xYtNEPMQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Poor Performance running Django unit tests after upgrading from 10.6
Re: Poor Performance running Django unit tests after upgrading from 10.6 |
List | pgsql-performance |
We have a large Django application running against a Postgresql database. The test suite for the application runs in GitLab-CI using Docker Compose to run the unit tests inside the application container against a Postgresql database running in another container. When Django runs the unit tests for the application, it starts by creating a new test database and then it runs the database migrations to create all the tables and the necessary reference data.Then for each test it opens a transaction, runs the test and then then rolls back the transaction. This ensures that the database is "clean" before each test run and reduces the risk that data created by one test will cause a different test to fail. Consequently this means that almost all the tables in the test database have zero or very few rows in them. It also means that the statistics for the tables in the test database are pretty meaningless. The statistics, if they exist, will probably say there are zero rows, and the query will actually be dealing with 0 - 10 rows that are visible in the transaction, but which will be rolled back. When we run the test suite using Postgresql 10.7 in a Docker container we consistently get: Ran 1166 tests in 1291.855s When we first started running the same test suite against Postgresql 12.4 we got: Ran 1166 tests in 8502.030s I think that this reduction in performance is caused by the lack of accurate statistics because we had a similar problem (a large reduction in performance) in a load test that we used which we cured by running ANALYZE after creating the test data and before running the load test. The load test is using the same Django application code, but creates a "large amount" of test data (in the 100s to 1000s of rows per table - it is looking for N+1 query problems rather than absolute performance". We have since managed to get the performance of the test run using 12.4 back to approximately the normal range by customizing the Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11` seem to be key, but we are also setting `shared_buffers`, etc. and all the other typical parameters. With Postgresql 10.7 we weren't setting anything and performance was fine using just the defaults, given the tiny data volumes. However, even though we have similar performance for 12.4 for most test runs, it remains very variable. About 30% of the time we get something like: Ran 1166 tests in 5362.690s. We also see similar performance reductions and inconsistent results with 11.9, so whatever change is causing the problem was likely introduced in 11 rather than in 12. I think we have narrowed down the problem to a single, very complex, materialized view using CTEs; the unit tests create the test data and then refresh the materialized view before executing the actual test code. Database logging using autoexplain shows things like: db_1 | 2020-10-14 10:27:59.692 UTC [255] LOG: duration: 4134.625 ms plan: db_1 | Query Text: REFRESH MATERIALIZED VIEW price_marketpricefacts_materialized db_1 | Merge Join (cost=14141048331504.30..9635143213364288.00 rows=116618175994107184 width=3302) (actual time=4134.245..4134.403 rows=36 loops=1) For comparison, the equivalent query on 10.7 has: db_1 | 2020-10-15 03:28:58.382 UTC [163] LOG: duration: 10.500 ms plan: db_1 | Query Text: REFRESH MATERIALIZED VIEW price_marketpricefacts_materialized db_1 | Hash Left Join (cost=467650.55..508612.80 rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1) The staggering cost implies that the statistics are badly wrong, but given how few rows are in the result (36, and it's not an aggregate) I would expect the query to be fast regardless of what the plan is. In 10.7 the materialized view refreshes in 150 ms or I also don't understand why the performance would be so inconsistent across test runs for 12.4 but not for 10.7. It is as though sometimes it gets a good plan and sometimes it doesn't. I can get performance almost identical to 10.7 by altering the unit tests so that in each test that refreshes the materialized view prior to executing the query, we execute `ANALYZE;` prior to refreshing the view. Is it worth us trying to debug the plan for situations with low row counts and poor statistics? Or is this use case not really covered: the general advice is obviously to make sure that statistics are up to date before troubleshooting performance problems. On the other hand, it is not easy for us to make sure that we run analyze inside the transaction in each unit test; it also seems a bit wasteful. Opinions and advice gratefully received. Roger
pgsql-performance by date: