


The statistical methods are presented with motivation from first principles, with deep citations into the fundamental statistical literature and formal derivations presented at the right level for an average NLP or ML practitioner to follow. They present numerous case studies to prove their point, and advocate and teach standard statistical methods as the solution, with rich examples of successful application to problems of NLP system evaluation and interpretation. The authors argue that the train-test split paradigm does not in fact insulate NLP from problems relating to the validity and reliability of its models, their features, and their performance metrics. Into this situation comes the book under review, Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science, by Stefan Riezler and Michael Hagmann. Yet statistical procedures remain rare in the evaluation of NLP models, whose performance metrics are arguably just as noisy.

Making a claim based on a bare difference of two numbers is unthinkable. Suppose model A gets a BLEU score one point higher than model B: Is that difference reliable? If you used a slightly different dataset for training and evaluation, would that one point difference still hold? Would the difference even survive running the same models on the same datasets but with different random seeds? In fields such as psychology and biology, it is standard to answer such questions using standardized statistical procedures to make sure that differences of interest are larger than some quantification of measurement noise. For anyone with a background in statistics or a field where conclusions must be drawn on the basis of noisy data, this procedure is frankly shocking. When we come up with a new model in NLP and machine learning more generally, we usually look at some performance metric (one number), compare it against the same performance metric for a strong baseline model (one number), and if the new model gets a better number, we mark it in bold and declare it the winner.
