Re: BUG #18177: certain queries under certain contexts take multiple orders of magnitude longer compared to v10 - Mailing list pgsql-bugs

From Robert Leach
Subject Re: BUG #18177: certain queries under certain contexts take multiple orders of magnitude longer compared to v10
Date
Msg-id F8C38955-0B08-4992-80C2-36DCC37B7BD6@princeton.edu
Whole thread Raw
In response to Re: BUG #18177: certain queries under certain contexts take multiple orders of magnitude longer compared to v10  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-bugs
I rearranged the order of the responses in this email based on the precedence of the things I've learned...

select * from pg_stat_user_tables where relid = '"DataRepo_peakdata"'::regclass;

When I try to execute this command, I get:

psycopg2.errors.UndefinedTable: relation "DataRepo_peakdata" does not exist

You'd need to be connected to the same database as that table exists
in.  Your EXPLAIN output had extra formatting in it. I had assumed the
quotes were there because the table was created with some upper-case
characters. PostgreSQL folds unquoted identifiers to lower-case.
Perhaps you can try without the quotes if you still get that when
connected to the correct database.

So I finally worked out that only this seems to do the right thing:

cursor.execute("select * from pg_stat_user_tables where relid = '\"DataRepo_peakdata\"'::regclass;")

Before, I was trying:

cursor.execute("""select * from pg_stat_user_tables where relid = '"DataRepo_peakdata"'::regclass;""")

and that apparently didn't work.

Anyway, here's what I see:

Slow case (when having run tests from a separate class first):

After the animal/sample load and before the peak data load:
(2911633, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the *slow* query:
(2914197, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)

These are the speeds of the 4 queries in the test:
pg query time: 0.001s
pg count time: 0.141s
carbon count time: 303.786
nitrogen count time: 300.561  (everything runs slower on my laptop at home, so it's over 200s)

Usually slow case (same as above, but the first time I tried it, it went randomly fast):

After the animal/sample load and before the peak data load:
(2911633, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the *slow* query:
(2913556, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)

Note, in this particular instance, the queries went FAST:
pg query time: 0.002s
pg count time: 0.066s
carbon count time: 0.036s
nitrogen count time: 0.031s

Always Fast case (when only running that one test):

After the animal/sample load and before the peak data load: 
(2912274, 'public', 'DataRepo_peakdata', 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the query:
(2912915, 'public', 'DataRepo_peakdata', 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, None, None, None, 0, 0, 0, 0)

Nothing looks particularly bug like so far.  It seems the
pg_class.reltuples estimate for this relation is way out.

If this is operating as intended, there are a couple questions I have:

1. In postgres 10, the query always runs in under 1 second, regardless of the context.  In postgres 13, depending on the context, why does it almost always takes over 200 seconds?

The reason the query is slow is that the planner thinks that there's
only 1 row in the DataRepo_peakdata table. In reality, when you
executed that query there were 7896 rows.

Because the planner thinks there's 1 row. It opts to use the fastest
method available to join two relations where one side has just 1 row,
i.e. Nested Loop.  Unfortunately, it's a risky plan as
underestimations have quadratic repercussions.  In the v13 query, this
comes out at this point in the plan:

             ->  Nested Loop  (cost=0.62..37.47 rows=1 width=8)
(actual time=10492.455..202323.935 rows=8 loops=1)
                   Join Filter: ("DataRepo_peakgroup".id =
"DataRepo_peakdata".peak_group_id)
                   Rows Removed by Join Filter: 7888
                   ->  Index Scan using
"DataRepo_peakdata_peak_group_id_4dd87f4a" on "DataRepo_peakdata"
(cost=0.25..8.26 rows=1 width=8) (actual time=0.017..7.149 rows=7896
loops=1)
                   ->  Nested Loop  (cost=0.38..29.20 rows=1 width=4)
(actual time=2.942..25.621 rows=1 loops=7896)

You can see the planner thought the index scan would produce 1 row.
7896 were found during execution.

Most of the time of the entire query can be accounted for by this plan
fragment. The average execution time for 1 loop of that 2nd Nested
Loop quoted above is 25.621 milliseconds. If you multiply that number
by the number of loops (7896), you get 202303.416 milliseconds.

Because the scan on DataRepo_peakdata contains no clauses, the row
estimate here is derived from pg_class.reltuples for that index.
VACUUM and ANALYZE adjust this pg_class column after they run so that
the planner has a recent estimate on the number of records in the
table. The estimated tuples for a clause-less scan of an index don't
come directly from pg_class.reltuples.  What happens is it looks at
relpages and reltuples to calculate the average tuples per page then
we apply that average over the *current* number of pages in the index.

7896 is the exact number of records that are loaded right before the query.  The load is of an excel sheet that has 94 rows and 84 columns:

84 * 94 = 7896

I will point out that the loading is not very fast at all.  Those 7896 records (along with other records) have always taken at least 80 seconds to load and the animal sample loading before that always takes at least 10 seconds.  It's only the subsequent queries after those loads complete that run unexpectedly slow.

It would be interesting to see what pg_class reltuples and relpages is
set to for this index.

If you have a query for me, I can send you the output.

2. In postgres 13, the query runs in under 1 second if I run just the one test that (starting from an empty database) loads data and runs the query.  BUT, if I run that one test after having run tests in another python/django test class (which loads data, runs various tests, then destroys all the data that that test class loaded), that query takes over 200 seconds to run.  Why would that be?

I don't know what these tests do.  If they update some data in the
tables then perhaps the index becomes bloated and that results in the
number of pages in the index changing which affects the estimate of
the number of tuples in the index.

It would be good to get the output of
pg_relation_size('DataRepo_peakdata_peak_group_id_4dd87f4a') before
and after the tests run.

All the tests I've been looking at just insert data.

I tried executing "pg_relation_size('DataRepo_peakdata_peak_group_id_4dd87f4a')" as a command, but I get a syntax error.

Perhaps it goes away after autovacuum runs ANALYZE on the table again.
I suspect this problem depends on the timing of autovacuum and suspect
you might see the tests running faster more often if you changed
autovacuum_naptime to 1 second. You'd probably see them more
consistently running slower if you set autovacuum_naptime to some
value higher than it's set to right now.

Based on a hunch from your comment about connecting to the database, I tried connecting specifically to out database when trying to do the vacuum, because I thought maybe that's why doing the vacuum didn't seem to speed things up (and that the early instances I did see a speed improvement was due to the random speedy queries I sometimes see without doing anything different), but I'm guessing that connecting to our database to do the vacuum doesn't work, because whenever I attempt it, the vacuum command just seems to hang forever.

Rob

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18178: New Restriction on "ON SELECT" rules on tables
Next
From: Bruce Momjian
Date:
Subject: Re: COPY TO CSV produces data that is incompatible/unsafe for \COPY FROM CSV