The Case of Simpson's Paradox
Aug. 21st, 2021 10:35 am
Basically, testing a software system we expect 100 microseconds improvement, which should amount to something on the order of 10% of response time, but we see much less than that.
How come?

This is a histogram of the number of responses with a given response time (a sample thereof). Blue is the distribution of old response times, orange is the distribution of response times with optimisation. It is clear that the shape moved left: orange has fewer responses at the higher response times, and more responses at the lower response times.
The problem here is that the system has two modes of operation, as is witnessed by the overlap of two "bell-shaped curves". Each of them is meant to improve by 100 microseconds independently. The improvement on the lower hump should be around 25%(!), the improvement on the higher hump should be around 8%. The theoretic distribution of requests between the two modes should be 2:1. Suppose the mean response time for the left hump is 0.5ms, and for the right hump it is 1.3ms. So overall we should observe: (2/3*0.5 + 1/3*1.3)/(2/3*0.4 + 1/3*1.2) = 1.15x - about 15% improvement.
Not even close! In reality we observed 7%.
Turns out, in reality the split between the requests is not 2:1, but more like 0.47:0.53. Ok, shouldn't we at least get (0.47*0.5 + 0.53*1.3)/(0.47*0.4 + 0.53*1.2) = 1.109x - about 11% improvement? Why only 7%?
Well, and that is because in the test run with the optimisation the split was 0.44:0.56. And this brings about Simpson's paradox - when an apparent improvement in each category does not result in nearly as big a change, when considered as a whole
| Fraction | Response Time, ms |
| OOB | Optimised | OOB | Optimised | Improvement
Overall | 1.0 | 1.0 | 0.914 | 0.848 | 1.07x
Fast | 0.47 | 0.44 | 0.5 | 0.4 | 1.25x
Slow | 0.53 | 0.56 | 1.3 | 1.2 | 1.08x
no subject
Date: 2021-08-21 10:15 am (UTC)Oh, an interesting application of the paradox!
no subject
Date: 2021-08-21 06:51 pm (UTC)