I'm quite confused by the scripts you shared, it seems incomplete. The run_regression.py is meant to call purge_cache.sh (which is missing), and the run_benchmark tries to call all sorts of missing .sql scripts.
That was a confusion I did, I uploaded run_benchmark that was just a nonsensical AI generated script in my work area and I didn't upload purge_cache.sh. Notice that you can still run the python script with evict mode off,pg. The missing script is attached in my previous message.
A table that is just 24MB and fits into buffers is a bit useless. It means that even with random pattern (which is generally about the best for prefetching), only about ~1/30 of pages will require I/O. Each page has ~32 items, but only the first item from each page will incur an I/O.
OK, I will start it with different parameters, for development I needed something that was not too slow, to be able to catch the bugs.
On what kind of hardware? How much variance is in the results?
Is a Mac mini M1. Please check the .PNG it shows the confidence intervals of the difference of execution time with the parameter ON and OFF, with different settings.