The Information Bottleneck (IB) objective minencoderI(X;Z)−β⋅I(Z;Y) was proposed as an explanation for deep learning generalization: networks allegedly compress the input during a distinct training phase. What is the main critique of this explanation?