Apologize for being so late to the party. Have been busy for a while & didn't notice this thread.
As a statistician by trade, I think I can say something about your "experiment design".
I didn't pay attention to the 85mm images at all since it's not in the discussion of f/2.8 vs f/4 in 70-200mm. Your low-light testing images are fine. However, the rest of the bright light testing images are useless for your comparision experiment. You used different apertures & shutter speeds when you shot the same objects. To compare the quality of your images between those 2 lenses, you have to make everything else completely identical. Otherwise, you are just comparing apple and orange.
As to your low-light setting comparison, in both f/2.8 and f/4.0 lenses, you used f/8 + 1/60. That's almost the best possible way to use the f/4.0 lens in that lighting. A better way to test their low-light capability would be use f/4 + (1/10, 1/16, ..., 1/800) and let the iso float, as in real world shooting would demand. In that way, you can test their VR capabilities as well.
You are right about using data and fact to prove or disaprove any statement or hypothesis. Only idiots would ignore a valid experiment, and stick their heads into sand -- "believe what they believe", really?? However, your experiment is set up to favor the f/4.0 lens, so to speak. Hence, your conclusion may be based on slippery ground.