UNDERSTANDING STATISTICS FOR THE INTERPRETATION OF TRAINING AND TESTING DATA
By Shaun McLaren
December 05, 2018
The collection, analysis and interpretation of training and testing data is a routine process in the physical preparation of athletes. We, as practitioners and coaches, utilize these data in a cyclical decision-making process that aims to maximize fitness and the readiness to compete, whilst minimizing fatigue and the risk of injury or illness. Ensuring we collect the right data is an important part of this process (e.g. valid, reliable and feasible measures), but equally important is ensuring rigor and accuracy in the way we interpret the data. This usually involves making a call on changes and trends within the individual, which are underpinned by both noise and practical importance.
In this article, I’m going to try draw on the concept of quantifying noise and practical importance within testing and training data, as well as giving an overview of some methods that can be used to incorporate both concepts into making an inference on changes within an individual. Everything discussed shall merely be interpretation of great work by others, such as Professors Alan Batterham, Greg Atkinson and Will Hopkins (noted in the bibliography), which I’ve found immense benefit from incorporating in my practice after being taught by Dr Matthew Weston and following Dr Martin Buchheit’s research output. I hope this article can serve as a useful signposting point for others to benefit in the same way. That is; practitioners looking to interpret their athlete’s training and testing data in a fairly rigorous quantitative fashion
Noise, or error, is what we don’t like. It makes our job (of interpreting data) much harder. We need to know the noise in our measures to be able to say what’s real and what’s not. By real, we mean something that’s true, systematic, or actual—i.e. not due to measurement error or random biological variation. And those are the two primary sources of noise that we often observe in testing and training data. First, there’s always going to be measurement error in the equipment or methods we use to collect data. This might be from the hardware and technology itself, or the fact that we’re using a practical alternative to a criterion measure. Random variation in our biological system refers to the natural fluctuations in the way the body functions or responds to a stimulus. It influences the repeatability of training and testing data, which could be due to factors such as diurnal variation, circadian rhythm, sleep, mood, stress, etc. Ultimately, more noise means more uncertainty when interpreting changes in the data.
A good example to help conceptualise how noise influences the repeatability of performance or responses during training and testing is the assessment of height in a countermovement jump (CMJ). We’ll typically test our athletes using a system which measures flight time and then estimates jump height via a formula. Here’s two sources of measurement error; first, in the technology we’re using to measure flight time and second, in the algorithms or calibration equations that we (or the proprietary software) are using to estimate jump height. Now, we might ask our athletes to perform 5 jumps and return 48–72 hours later, following complete rest, to perform another 5 jumps. It’s almost intuitive that we will never expect to record the exact same jump height twice for a given individual in their 10 jumps. Why? Because as well as measurement error, there’s random variation on a jump-to-jump and day-to-day basis. It’s unlikely that these performance fluctuations are caused by true changes in fitness or fatigue; rather, the random variation in jump performance could be explained by some of the aforementioned factors plus aspects of technique and motivation.
The above example highlights a very important point when interpreting training and testing data—we almost never measure true performance. We do observe performance that could be within a likely range of the true performance, though. So how do we overcome this issue of noise? In the first instance, we probably want to try select tests and utilise measures that we know to have low or acceptable measurement error and aren’t subject to large biological variation. Unfortunately, this is a luxury that’s often scarce in the world of athletic performance! So, in any instance, we should try quantifying the noise in our data, which should be done so over the same period that we want to make a call on a change in the individual (e.g. day-to-day, week-to-week, month-to-month, etc.) This is really important to keep in mind. It might not make much sense to quantify the noise of a test or training performance over a two-day period if we are interested in monitoring changes on a weekly or monthly basis, which are likely subject to greater noise.
We can quantify noise as a within-athlete standard deviation (SD) from testing the same athlete multiple times when we don’t expect the performance to change (say a minimum of 10 measurements, as a ballpark figure). In this setting, the within-athlete SD can be referred to as the typical (standard) error of measurement. If it’s not feasible to do test the same athlete multiple times when we don’t expect the measure to change—which is usually the case—we can estimate the typical error from a test-re-test study. In this type of study, we measure a group of athletes’ at least twice, when we believe they shouldn’t change their performance in a given test (at least, theoretically). In actual fact, we know performance will randomly fluctuate within each individual, and that’s what we’re trying to capture. We can then make an estimate of the typical error (explained later) and use it to calculate a confidence interval (Figure 1). A confidence interval is the likely range of the true value and represents uncertainty in the observed performance or change in performance that we’re measuring (more on how to get this later also). This is a crucial and extremely advantageous step when interpreting training and testing data.
Practical importance might often be referred to as something substantial, worthwhile, or meaningful—i.e. it makes a real-world difference. Let’s pretend for a minute that we live in a fictitious, robotic setting, where noise doesn’t exist (or its small enough to not care about). What we observe is the true performance in a test every single time. Quantifying practical importance boils down to this: what is the change or difference that we need to see to make an impact on something that we care about? We need a value to represent this and we want to scale our athlete’s changes in performance against it (Figure 2).
For top-level athletes whose actual competitive performance can be measured in and outside the arena (e.g. time trial, distance thrown, weight lifted, etc.), an improvement equivalent to one third of the competition-to-competition variability (SD or coefficient of variation) results in an extra medal every 10 competitions. But what if we don’t work in these sports, or were looking to interpret training and testing data that isn’t competitive performance? Thinking back to the whole reason why we might be monitoring or testing our athletes in the first place, we could be using an indicator of fitness, fatigue, health, wellbeing, or the risk of injury and illness. You’ve got to acknowledge that while these constructs are important, and often the crooks of our jobs as coaches and scientists, they’re multifactorial and extremely difficult to measure even with several surrogate indicators. What’s equally difficult is associating a change in one of these indicators with competitive performance, such as a technical/tactical outcome (e.g. total shots on target or percentage of effective rucks), or selection/training availability. Trying to make these links is important so that we can make the right call when interpreting data to plan, adjust or evaluate training programmes.
Thresholds for practical importance can be derived from two primary sources: anchor-based and distribution-based. An anchor-based approach might often be referred to as the minimum clinically important difference (MCID)—although in reality, as coaches interested in sports performance, minimum practically important difference might be a better term. Here, we could look within the literature for a prognostic- or validity-type study in which our measure has been used as the ‘predictor’ for something we care about (e.g. a substantial change in competitive performance, health, fitness or fatigue). An anchor-based approach would be my first, go-to point in trying to find a reference value for practical importance because it is a robust method with high ecological validity, allowing us to relate changes in training and testing data to real-world outcomes. On the flipside, unless MCID are obvious (e.g. a change of 1 unit on a linear perceptual scale, such as pain), they’re extremely difficult to determine and often require very large studies with complex analyses.
Utilising an opinion-based MCID is a sub-category of an anchor-based approach that might offer an applied alternative to a robust, research-based anchor. Here, we can look to utilise information existing within the field that may not exist within the literature, such as the knowledge and experiences of expert coaches or practitioners. In his Essentials Guide to Velocity-Based Training, Dr. Dan Baker notes that changes of ~0.04 m·s-¹ from the best velocity scores with a given resistance > 80% one-repetition maximum usually indicates a change in maximum strength of 2.0–2.5%. So, if we’re interested in top-end strength and we’re monitoring bar velocity during training lifts of > 80% one-repetition maximum, then 0.04 m·s² might be a good threshold to use for making inference on an athlete being weaker or stronger by a magnitude of 2.0–2.5%. A pro of this approach is that it allows us to utilize historic information from within practice that may be specific to the sport, competition, and athlete group. This information simply doesn’t exist anywhere else. Yet, with this, the credibility of the information relies on the knowledge and experience of the person providing the values.
We might often find ourselves in a position where we can’t obtain an anchor-based value for practical importance in research or practice, though. A distribution-based approach might now be a feasible alternative. This method uses Cohens effect size principle and is often referred to as the smallest worthwhile change (SWC). In a distribution-based approach, we look at the typical deviation in how our athletes perform between one another (between-athlete SD) and take a fraction of this to represent the change required to meaningfully ‘move’ their position within this distribution. If we’re happy that our sample of athletes are representative of their population and we have enough of them, this should work OK. To calculate the SWC, we multiply the between-athlete SD for a given test by 0.2. We can use factors of 0.6 and 1.2 to standardize moderate and large effects too. An advantage of this approach is that we can calculate reference values for changes and differences with ease, using a method of strong statistical footing. But in athlete’s who are all very similar (e.g. elite) or in some particular measures, these thresholds can be extremely small (increasing the risk of us making the wrong call by saying a change is important, when it isn’t). Furthermore, we don’t know if a distribution-based difference or change makes any difference to something we care about in the competition arena (being on the pitch in the first place or outperforming the opposition thereafter, for example).
Using practical importance and noise to make a call on the individual.
Hopefully it’s clear that understanding both noise and practical importance are crucial when trying to interpret changes in training and testing data. One essential consideration is that practical importance is not noise and noise is not practical importance. They can’t and shouldn’t be used interchangeably. Sticking with one in neglect of the other only tells you part of the whole picture. A common pitfall can be using noise, or a derivative of noise such as minimum detectable change, as a threshold for practical importance. You might well detect a ‘real’ change, but you don’t know if this change is practically important, or not!
So, how can noise and practical importance be combined to interpret a change in training or testing data? Let’s say we’ve managed to get an estimate of typical error and used to work out a confidence interval for an observed change. We now have a likely range of the true change that we could scale against a threshold for practical importance (Figure 3).
We’ll probably also want to look at how much of our confidence interval clears the threshold for importance, particularly the lower end of the confidence interval (the ‘lowest’ that the true change could be in relation to no change at all). If we observe a change of 4%, where the 90% confidence interval is ~0.5% to ~9.0% and the MCID or SWC is 2%, we could say with some degree of certainty that the 4% change is real and important (Figure 3).
This is a feasible approach, but it all sounds a bit subjective and open to interpretation, don’t you think? What if we could quantity what our eyes are seeing as a probability, that is; the probability that the true change is greater than our threshold for practical importance? Well, we can, and there’s a couple of feasible approaches. One such approach that I’ve found to be extremely useful is magnitude-based inference (MBI; Batterham & Hopkins, 2006). Magnitude-based inference offers a robust and accessible method for practitioners and coaches to quantify the above scenario on a per-athlete basis. This allows us to work out the probability that an athlete’s true change is greater than the MCID or SWC, using probabilistic connotations to convey the individual response as being possibly, likely, very likely or almost certainly substantial. Group-level MBI is commonplace in the sport science literature (making inference on mean differences), but MBI on an individual level is far less reported and possibly less common to practice. The technicalities of MBI and how to apply it to an individual’s change are well documented elsewhere (Batterham & Hopkins, 2006; Hopkins, 2004; Hopkins, 2017), so I won’t go into that here. I’ll use the next section to work through an example of how it’s all pulled together. We’ll look at the individual responses to a velocity-based training programme.
The following data were kindly provided by Dr. Jonathon Weakley. In this training study, 28 semi-professional rugby union players completed a four-week pre-season training period (field-based technical-tactical sessions, resistance and plyometric training, speed sessions, etc). During speed and resistance training, player’s sprint split times and mean concentric bar velocity were monitored on a per-rep basis, with feedback provided to just over half of the athletes (n = 16), while the others received no feedback and acted as a control (n = 12). Players were tested for a range of fitness qualities before and after the intervention, including 20-m time during a linear sprint (20-m) and peak power output during a bodyweight CMJ (CMJPPO). The raw data and change scores for 20-m are provided in Table 1 to facilitate the example.
Table 1. Raw data and change scores for 20-m time during a linear sprint before and after a four-week training period of velocity-based feedback or no feedback (control) in 28 rugby union athletes.
In this scenario, the control group act as the perfect test-retest reliability study to obtain an estimate of four-week typical error in 20-m and CMJPPO. We test the athletes, let them undergo ‘normal’ activities, and re-test them four weeks later. To estimate the four-week typical error, we work out everybody’s pre–post change (the column labelled ‘Change’ in Table 2), take the between-athlete SD of these changes, then divide them by the square root of 2 (Hopkins, 2000). In our sample, the four-week typical error turns out to be 0.03 seconds (s; about 1%) for 20-m and 202 Watts (W; about 3.5%) for CMJPPO. We’ll now be able to work out a confidence interval for each player’s change in the intervention using these typical errors. To do this, the typical error should be converted back to the SD of change scores (multiply it by the square root of 2) then multiplied by something called a t-value (two-tailed inverse of the Student's t-distribution) for whichever level of confidence we want to apply. This is best given by Microsoft Excels TINV function, but if the typical error came from a reasonable sample (near 30 or above) we could simply use a fixed z-value (Table 2).
Table 2. Z-values for confidence intervals and limits.
Using our example data (12 players tested twice, giving the above typical errors) and Microsoft Excels TINV function, the 90% confidence limits (in raw units) for observed changes in 20-m and CMJPPO should be ±0.08 s and ±513 W, respectively.
Now, we’ll use a distribution-based approach to calculate the threshold for a small/substantial change in each measure. Here, we take the between-athlete SD of everybody’s ‘pre’ scores (Column 3 in Table 1, n = 28) and multiply it by 0.2. The SWC comes out as 0.03 s (about 1%) for 20-m and 149 W (just over 3%) for CMJPPO. Using specially developed spreadsheets (Hopkins, 2004; Hopkins, 2017), we can now assess the probability/ likelihood that each player’s change in the feedback group was greater than the SWC, taking into account the likely range of their true change, via MBI. We’ll use the standard definition of a non-clinical inference, whereby the inference is unclear if the true change could exceed the SWC in both a positive and negative direction by a likelihood of 5%.
Displayed in Figures 4 and 5 are the individual responses to feedback for 20-m and CMJPPO respectively. The graph displays the observed change (individual points, x-axis) for each player, with the 90% confidence intervals (grey bars). The SWC thresholds are marked as the edges of the grey shaded area (zone for trivial, or no substantial change). For 20-m, the white area to the left of the SWC threshold represents a substantially quicker time following the feedback intervention. For CMJPPO, the white area to the right of the SWC threshold represents substantially greater power following the feedback intervention. To the right of each chart is the final qualitative inference answering the question “did the player respond to feedback?”, along with the quantitative probabilities (% chance) that the true change is, less than (green for 20-m [faster], green for CMJPPO [less power]), equal to (grey), or greater than (red for 20-m [slower], green for CMJPPO [more power]) the SWC.
For 20-m, player’s 1, 7 and 14 almost certainly responded to feedback. Players 4, 5, 10 and 16 were very likely responders, players 6, 8, 9, 11, 12 and 13 were likely responders, player 2 was possibly a responder and the inference for player 15 was unclear. There seems to be consistent, positive individual responses to feedback for 20-m, with most player’s post-intervention times being at least likely faster following the feedback intervention.
For CMJPPO, player 5 almost certainly responded to feedback. Players 4, 7 and 16 were likely responders, players 3 and 6 were very and most unlikely responders, respectively, and all other inferences were unclear. For the clear inferences, there seems quite varied individual responses, with some players showing substantially greater CMJPPO and some showing substantially less. This could be explained by some of the physiological, psychological and social factors that are known to moderate the response to feedback (amongst others), but given the amount of unclear inferences, we may not want to dig too deep into this at the moment.
What’s really important about the way in which we interpret these results is retaining uncertainty when articulating findings and making decisions. We should never force a dichotomy in saying that an athlete DID or DID NOT respond, because its counterintuitive to the approach we have taken (and the world we live in!). Instead, we need to acknowledge that we have chosen a method that gives us a plausible range of changes in each athlete, compatible with way in which we have analysed the data. This, for me, is when a method becomes a philosophy.
Finally, we could tighten or loosen the constraints on our MBI, depending on if it seems justified to do so. This could be done by changing the level of confidence (e.g. 99% or 80%) or magnitude threshold for practical importance (e.g. 0.6 × between-athlete SD, for a distribution-based approach). It’s important that any changers are well-justified from both a statistical and conceptual perspective, however.
In a world where we chase practically meaningful outcomes, we owe it to our athletes to not fear noise and uncertainty, but to embrace it. Considering both noise and practical importance should be an integral part of the athlete decision-making process. I hope the conceptual framework presented here (visualized in Figure 6) is useful to those responsible for interpreting training and testing data to help guide athlete management, training prescription, and ultimately, maximize performance potential.
REFERENCES AND RESOURCES
Atkinson G, Batterham AM. True and false interindividual differences in the physiological response to an intervention. Exp Physiol. 2015; 100(6):577–588.
Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med. 1998;26(4):217–238.
Batterham AM, Hopkins WG. Making meaningful inferences about magnitudes. Int J Sports Physiol Perform. 2006;1(1):50–57.
Buchheit M. The numbers will love you back in return—I promise. Int J Sports Physiol Perform. 2016;11:551–4.
Buchheit M. Want to see my report, coach? Aspetar Sports Med J. 2017;6:36–43.
Hopkins WG. A spreadsheet for monitoring an individual's changes and trend. Sportscience. 2017;21:5–9.
Hopkins WG. How to interpret changes in an athletic performance test. Sportscience. 2004;8(1):1–7.
Hopkins WG. Measures of reliability in sports medicine and science. Sports Med. 2000;30(1):1–5.
Swinton PA, Hemingway BS, Saunders B, Gualano B, Dolan E. A statistical framework to interpret individual response to intervention: paving the way for personalized nutrition and exercise prescription. Front Nutr. 2018;5:41.
Ward P, Coutts AJ, Pruna R, McCall A. Putting the “I” back in team. Int J Sports Physiol Perform. 2018 (ahead of print).
BIOGRAPHY: SHAUN MCLAREN
Shaun is a sport scientist and strength & conditioning coach who has been involved with the physical preparation of athletes for several years. He currently works as a sport scientist with England Rugby League and as a Research Assistant with Leeds Beckett University. Shaun’s main areas of practice and research include the programming, monitoring and evaluation of training. Shaun can be found on Twitter @Shaun_McLaren1.