VALIDATION: A HISTORICAL NARRATIVE REVIEW OF PUSH BAND 1.0 RESEARCH (PART 2)
By Chris Chapman
June 12, 2019
NOTE: This is a dynamic document and will be updated as new research is published. Part 1 of this installment can be found here. Scroll down to discussion to go directly to the research review continued.
Whenever a coach or scientist is interested in using a piece of technology in the daily training or testing environment, in its most simple form they need to know if it does what the company says it does. This is known as construct validity, or "the degree to which a test measures what it claims or purports to be measuring”. In this series we are going to explore the science of weight room technology validation, the current and historical state of the literature, in addition to the processes we use here at PUSH.
This three part installment will focus on the PUSH Band 1.0, including the literature available-to-date, how this data is used to improve the product, and how PUSH worked with researchers to dial in methodologies based on the results of these external studies. The variables used to assess validity and reliability will be introduced as they appear in the literature, as you will see a very clear progression in the statistics and selected criterion used to assess the device.
One more thing to note before diving in: the Band 1.0 version of the PUSH device being discussed has been off the market since March 2018 when we released our next generation hardware Band 2.0, yet there are still papers being published to date with two coming out over the last few months. This highlights one of the biggest current issues in the sport technology space: the peer review literature publication model cannot keep up with the speed of technological progression. For practitioners it is good practice to not only do a literature review but to reach out to companies directly to see what internal scientific processes and data are available. Any company pulling their weight will be ahead of, or at the very least on top of the literature with their internal research and development.
The following is a chronological summary table of the third party (external to PUSH), peer reviewed published literature regarding the PUSH 1.0 device.
DISCUSSION (PART 2)
The following is a discussion of the third party (external to PUSH), peer reviewed published literature regarding the PUSH Band 1.0 device. This post only covers articles 4-6. The previous post covers articles 1-3, while articles 7-9 will be covered in the subsequent post.
4. VALIDITY OF VARIOUS METHODS FOR DETERMINING VELOCITY, FORCE AND POWER IN THE BACK SQUAT
Banyard, H.G., Nosaka, K., Sato, K., Haff, G.G. (2017). Validity of various methods for determining velocity, force and power in the back squat. International Journal of Sports Physiology and Performance. 12(9):1170-1176.
This was the first study to be published after I joined PUSH full-time in late 2016 after leaving my position at the Canadian Sport Institute following the Rio Olympic Games. Previous to that, I was a scientific advisor to PUSH, but not heavily involved in their research process other than an occasional data collection I facilitated at the U of Toronto Biomechanics Lab. I was still in the honeymoon phase learning my role and had just started creating a plan and process around product validation and internal research when this paper dropped. Banyard et al. (2017) was definitely incentive to think about and question our current research and innovation practices as the results of this study were unfavourable for PUSH on first read. This one will be a beefy discussion since they did some novel work and there is a lot to unpack here.
This study comes out of Edith Cowan University as part of Harry Banyard’s PhD work in Greg Haff’s lab. He has since published some other great studies challenging and supporting current VBT practice. For this study they looked at a free weight barbell back squat, comparing numerous outputs of a commercially available linear position transducer (LPT) GymAware and the PUSH Band 1.0 versus a 4-LPT system combined with a force plate as the criterion measures. In addition to velocity, they examined mean and peak power and force, being the first to look at estimations of the latter. Given the Band 1.0 is an IMU with an accelerometer, using Newton’s second law (Force = mass * acceleration) we can directly estimate force using the measured acceleration since the mass is known, so this information was valuable to us. The other novel addition Banyard et al. made was to break down the velocities into relative strength zones lifting loads of 20%, 40%, 60%, 80%, 90% and 100% of one repetition maximum (1RM) back squat strength. Previous studies typically used a single- or two-load protocol, which doesn’t assess the full spectrum of velocities that can be measured by these devices.
Looking at the overall results (Figure 1), on first glance both devices look good qualitatively. Velocity trendlines look nearly identical with only a mean velocity difference for PUSH data at 1RM loads. Force data also trends identical in both tools, with an observed overestimation bias in mean force with the Band 1.0. Power data for the Band 1.0 is also only significantly different in mean measure at 1RM loads. These trendlines don’t line up as cleanly as the others and you can see more divergence in the mean power plots. As discussed in the last post this is potentially due to the differing mass models used. It should be noted the paper incorrectly states that PUSH uses a system mass model when we use an effective mass model. An effective mass model accounts only for the mass that is accelerating, since not all mass is accelerates at the same rate, at the same time throughout the movement. Total system mass models (sum body mass + external load) are used because they are more simple to implement mathematically. The important point here is knowing the data will not be identical, there will be some discrepancy. In general, you should find that peak measures align closer than means when comparing power and force between the two mass models, since it’s plucking a single point in time versus averaging out the whole time series over the concentric range of motion. This is exactly what is observed in this study, so the results weren’t unexpected to PUSH.
The first thing we cannot discern from the graphs in Figure 1 is the variability in the data, as most of the bars are layered on top of each other. Qualitatively they look rather consistent across the relative load conditions. The next set of figures is where the authors took a deep dive, elegantly breaking down the results statistic-by-statistic for the pooled data (Figure 2) and for each metric by relative load (Figure 3). To date, this type of analysis had not been completed in this type of study validating weight room technology. Pearson correlations (r), coefficients of variation (CV%), and effect sizes (Hopkin’s modified Cohen’s d) were used to evaluate validity. Pre-study cutoffs were implemented using commonly accepted evidence-based standards: r > 0.70 (very high), CV < 10% (moderate), and ES < 0.60 (trivial or small). The standard error of measurement (SEM) was also reported and data were similar to those reported in the previous studies for the Band 1.0.
This was the first time effect sizes were introduced in the assessment of the Band 1.0. Effect sizes inform the reader of the magnitude of any difference between two variables instead of just observing a statistically significant or non-significant difference (p-value). A significant difference might be clinically trivial, which may change the interpretation or application of the results since it may have no actual real world effect. The use of this magnitude based inference was made popular in the sport science world by Will Hopkins, and arguably has caused one of the more polarizing debates in recent times. Banyard et al. (2017) in this case reports both statistics, which to me seems like the most simple solution to the problem since both sides can interpret their preferred statistical method of choice.
In many cases, the PUSH Band did not meet one of these criteria which then deemed the tool not valid for that metric at that relative load range. This feedback was really good for us since the previous studies were all positive and affirming the performance of the device, while this one highlighted gaps, showing us areas for improvement. The majority of the error was observed at slower velocities (heavier relative loads) which also seemed to drive some of the error in the power data. It made sense that higher errors were observed at slower velocities, since there is less of a change in velocity so signal detection of acceleration is more difficult to achieve cleanly. This provided us with immediate direction for product improvements. In contrast, a positive that came out of this study for the Band 1.0 was the bias observed in previous studies was not evident here for the pooled mean or peak velocity data. An overestimation bias was only observed in mean force and peak power data.
The force data results were the biggest positive outcome for PUSH since it was the first time it was externally examined. When first principles are used it makes perfect sense why the force data were the most accurate - the PUSH Band contains an accelerometer, Newton’s second law is Force = mass*acceleration, the mass is known and acceleration is measured directly therefore at a superficial level force is a single computation (getting clean signals is the not-so-simple part). This is great because force is used in computations of power, work and impulse, and all three metrics are provided by the PUSH Band. This helped us determine the source of potential causes of errors in these metrics. In this case, it seems velocity is driving at least some of the error in power measures, since both having matching patterns in the plot trendlines combined with identical significant differences only at the heaviest loads over the pooled dataset.
If we take a step back and look at the results of this study as a whole, they were much more positive to us internally than the article discussion may elude to based on the language and conclusions. The overall data were actually pretty good given PUSH was still a new company trying to innovate and disrupt the weight room tech space when this study was completed (~18 months before publication). Looking at the plots, the Band 1.0 performs pretty consistently except for heaviest relative loads. Looking at Figure 1 qualitatively, there isn’t much difference in the average datasets as you can see both tools and criterion are providing essentially the same plots. While some rep to rep discrepency may be observed, over the set it washes out which may not matter in some use cases. Being more accessible at 1/10th the cost of the competing tool, for a lot of applied velocity-based training (VBT) methodologies the Band 1.0 was definitely viable for general feedback and specific use cases that didn’t require a lab-level quality of data. However, the identificaiton of these gaps gave us targets to hit and definitely motivated us to focus efforts on improving the performance of Band 1.0.
First study to look at a freeweight barball back squat using the PUSH Band 1.0
First study to assess force estimations using the PUSH Band 1.0
First study to separate data into relative 1RM percentages to assess a full spectrum of loading scenarios
Comprehensive statistical model for validity and reliability
LPT system for criterion measures - alignment of point mass models for kinematic-based measures
LPT system for criterion measures - use of a proprietary 4-LPT system
Force plate + LPT - combining different point-mass models in the same data computation is not an apples-to-apples comparison and can introduce errors in the criterion measures.
Velocity and power data were less valid for Band 1.0 as velocities got slower (heavier relative loads)
Force data were valid on first assessment of Band 1.0
Systematic bias only observed in mean force and peak power in the Band 1.0
The standard error of measurement for velocity was similar to previous studies
Focus development efforts on improving Band 1.0 performance at slower velocities
First assessment of force estimations showed using the PUSH for this measurement is viable - continue further development of kinetic estimations
Previous improvements in Band 1.0 data were successful since systematic bias was not observed in the velocity data
5. VELOCITY BASED TRAINING: VALIDITY OF MONITORING DEVICES TO ASSESS MEAN CONCENTRIC VELOCITY IN THE BENCH PRESS EXERCISE
McGrath, G.A., Flanagan, E.P., O’Donovan, P., Collins, D.J., Kenny, I.C. (2018). Velocity Based Training: Validity of Monitoring Devices to Assess Mean Concentric Velocity in the Bench Press. Journal of Australian Strength and Conditioning. 26(1):23-30.
This next study comes from the University of Limerick and the Sport Institute of Ireland. It was designed similarly to the previous study with PUSH Band 1.0 and a commercially available LPT (Tendo Weightlifting Analyzer) examined for validity and reliability. The primary difference herein being the criterion measures were collected using 3D motion capture opposed to a proprietary 4-LPT system. For measuring motion (kinematics) this is the gold standard. Most systems can accurately measure the location of a point and its displacement in 3D space within a 0.5 mm of error. The authors state the average error of their system was 0.57 mm for this study. In contrast to Banyard et al. (2017), only mean velocity outputs were assessed. No peak values and no force or power metrics considered for examination.
The exercise selected for evaluation was a freeweight barbell bench press, with participants completing 2 sets of 6 repetitions at 40% and 80% of 1RM. Not as many relative load categories as the previous study, but at least they assessed a fast and a slow condition which is much more valuable than a single load on its own. One other thing I really like about this study is that they described in detail their instructions given to the participants. This is something I still find lacking in much of the published research methodology these days. Instruction and demonstration can greatly affect the participants’ performance. For example, relating to velocity output, instructing and enforcing a rapid eccentric-to-concentric contraction versus a pause-to-concentric powerlifting style technique will not only change the resulting values measured by the devices (different event detections) but it will also the alter the abilities of the performer (utilization of the stretch-shorten cycle). Readers must be aware, especially if using these types of studies for normative data since they may not line up with other data in the literature. In this current study, the authors did a great job and explicitly state a pause was completed at the bottom of the press prior to the concentric phase, and this was purposely done as a control to acquire better data in detecting the bottom range of motion.
Descriptive results of this study are provided (Table 2). The PUSH Band 1.0 showed an underestimation bias in mean velocity measurements across all conditions versus the criterion measure. The Band 1.0 performed much better at the higher velocity/lower load condition, with a slightly higher coefficient of variation (CoV / CV) compared to the motion capture. At the lower velocity/high load condition there was much more variability in the data with the CV values being significantly greater while deviating much more from the criterion measures. The interesting part is that all of the observed data, including the criterion motion capture, are technically not within the acceptable cutoffs for CV in the literature as discussed in the previous post. In this case none of the tools would be considered “reliable” by statistical standards. I find this rather interesting, and it highlights a potential flaw of using these hard cutoffs for human movement, since the nature of the data could be the source of the variability not the tools themselves as the criterion data would indicate here. Six repetitions at each load were completed and at the 80% condition would relate to an 8RM or eight repetitions-to-failure. Six reps at this load would induce fatigue and create velocity loss increasing the variability in the data for a given condition. The CV cutoffs would make sense if we expected the exact same value for each repetition, however human movement is never this exact especially when summing up a dynamic muscular system with many degrees of freedom into a single point in space. This could easily explain the observations here as the heavier load would induce much more fatigue and this condition has twice the CV across velocity measures for all devices. Another explanation for the higher CV at lower velocities is that a consistent device error across all velocity ranges would be a higher percentage of the total observed value, driving up the CV since it is a relative measure of variation. Again we see this with all devices in the current study. A different repetition protocol or a set-by-set analysis would be required to tease out how much each of these is driving the great variability observed in the high load / lower velocity condition.
The interclass corelations (ICC) for both devices were strongly correlated (r > 0.90) to the criterion motion capture (Table 3). In this case a different variation of the ICC was used by comparing the device against the criterion instead of using multiple instances of its own measurement for reliability. This turns it into another measure of concurrent validity instead of reliability.
Looking at the linear regression plot (Figure 4), the trendline also shows a strong correlation to the criterion (r = 0.92). However, here you can visually see the variability discussed earlier in Table 2. While there are clear upper and lower bound limits, ideally you want ot see the data points as tightly clustered around the line as posssible.
Looking at the Bland-Altman plot we see the average bias was about 0.20 m/s for mean velocity (Figure 5). The upside was that the data fall within the Bland-Altman limits of agreement for the two devices. The downside was the observed bias was the largest seen to date for the PUSH Band 1.0 in the externally published literature. We also weren’t seeing this in our internal data collections either. This caused us to investigate further and we found there was an issue with the factory calibration settings. This wasa a quick fix for us and helped to ameliorate the amount of bias observed.
There was another potential source of error that needs to be considered within this study. The method of virtual marker use could introduce extraneous error into the motion capture data. A marker is a sticker or a physical sphere, typically relflective in nature so location can be triangulated using multiple infrared motion capture cameras. A virtual marker is when you create a marker based on the initial calibration to strategically aid in the data collection. Since the motion capture system is calibrated with the known distance and orientation of the physical markers in 3D space related to the anatomical landmarks they were attached to, you can also strategically remove some of the physical markers that cannot be tracked during the collection or digitally create a new one. They can both be mathematically recreated and tracked in post-processing. In this case it was done to solve the problem of the weight plates blocking the camera’s view of the hands and the center of the barbell, and is a logical use of this procedure. The issue however, is that they only use a single physical marker in tracking the actual motion. This unfortunately tells you nothing about the orientation (tilt) of the bar since you require three markers in a non-linear pattern to track the orientation of a rigid mass accurately in 3D space. My guess, and I’d be interested to talk to the authors as I am purely speculating here, is that the assumption of the bar remaining horizontal throughout the bench press movement was made. If so this could have definitely introduced some error in the criterion motion capture data.
Appleby et al. (2018) recently showed that placement of barbell tracking devices and markers on the outside of the barbell gives significantly different results compared to the center of the bar. You unfortunately cannot assume they move at the same velocity. The results from their study mean either the bar does tilt throughout the motion or there is a whip effect at the bar ends, both which can occur depending on the exercise and the stiffness of the barbell being used. Peak velocity should be less effected, but the mean values such as those used in McGrath et al. (2018) would be more affected since the whole motion is considered. Since the single marker is aligned with the LPT (similar point-mass in space), but the virtual markers for the arm and bar center may be estimated incorrectly, it could create discrepancies in the motion capture data. It is likely that the bar is not kept perfectly horizontal and possible that the arm changes positions in space. The latter is less likely given the detailed instruction and monitoring of technique that was done by the authors. If you have enough motion capture cameras this problem is easier to solve since you can place them in front and behind of the bench to collect hand and bar center markers directly. I have also used a technique where you place markers on either end cap of the barbell in order to collect the bar ends and compute a barbell center position in post-processing. This is where you go down the rabbit whole of the “the art of the science” of quantitative motion analyses. Like many things in our space, it is never as straight forward as it may seem.
Internally, we have also observed that bar tilt and non-linear lifting has a greater effect than our eyes tend to perceive. When multiple PUSH 2.0 Bands are placed on a barbell, the one closest to the LPT is always the most similar in output values nearly 100% of the time. Each band that is further away, even by a few inches, results in data that is greater in discrepancy. The big takeaway here, when comparing multiple tools for barbell tracking be sure to place them as close to each other as possible for the best evaluation of the data.
Overall this study told a similar story to the last, just on a new exercise. The lighter load / higher velocity condition faired out better statistically. The relative variability in the data and the bias were both greater than expected, being the largest seen to date. This was the important learning for us and led us to identify and solve a calibration problem to help minimize the size and variability of the observed bias. After more internal testing we have also found that some of this error is due to the nature of the bench press exercise itself. We found two primary variations in the way the movement is performed that can affect the data. Some athletes tend to let their shoulders come off the bench with max intent pressing, creating a double impact. Others tend to actively stiffen in order to decelerate the bar which results in their body not coming off of the bench. The first variation we now call the “splash effect” internally and have found it affects the output data in most barbell measurement devices, LPTs included. For research purposes scientists should be aware of this and it further supports the use of consistent instruction and demonstration in study controls as discussed herein.
First study to assess Band 1.0 in a freeweight barbell bench press
Use of 3D motion capture for criterion measures
Comprehensive statistical analysis primarily for validity
Robust description and use of instruction and demonstration
Methodology of virtual marker use could introduce extraneous error
Band 1.0 shows underestimation bias in mean velocity data
Band 1.0 shows strong intraclass and Pearson correlations to motion capture
Band 1.0 data falls within the Bland-Altman limits of agreement
Band 1.0 data is more variable at the lower velocity / high load condition
Discovery and resolution of a factory calibration issue leading to greater amplitude and variation in the observed bias
6. VALIDITY AND RELIABILITY OF A WEARABLE INERTIAL SENSOR TO MEASURE VELOCITY AND POWER IN THE BACK SQUAT AND BENCH PRESS
Orange, S.T., Metecalfe, J.W., Liefeith, A., Marshall, P., Madden, Leigh, A., Fewster, C.R., and R.V. Vince. (2018) - Validity and reliability of a wearable inertial sensor to measure velocity and power in the back squat and bench press. Journal of Strength & Conditioning Research. May 2018 - Ahead of Print.
The last study in this article comes from Sam Orange at the University of Hull. Sam has done some novel work looking at power measures in activities of daily living with elderly/clinical populations. Measuring and training speed and power is not exclusively just for athletes as most can benefit from lifting with intent. This study here is very similar and almost a combination of the last two, looking at both a freeweight barbell back squat and bench press comparing the Band 1.0 to a commercially available LPT as the criterion measure. A test-retest or repeated measures reliability study was conducted. Also this was the first study done on a younger demograph, with the participants were youth rugby league players. Don’t let the age fool you though, these teens had an average relative squat and bench of 1.71 and 1.18 of their bodyweight respectively. The testing protocol was the same as Banyard et al. (2017) with relative loads of 20%, 40%, 60%, 80%, 90% and 100% of 1RM performed over two sessions.
Mean and peak velocity and power were examined. Average values were plotted for all datapoints at each load condition in the back squat and bench press for both devices (Figure 7). Qualitatively the data look similar to Banyard et al. (2017), with both tools displaying similar trendlines. Greater bias was observed in the back squat data. Power values have larger discrepencies in the trendlines. More variability and less consistent standard deviations were observed in the bench press data. Looking at the absolute reliability for the Band 1.0 (Table 2), we see a standard error of measurement (SEM) hovering around the +/- one repetition variance range (0.05 m/s). The SEM does increase as loads get lighter, especially for the bench press. As mentioned in the last study, the splash effect could be be responsible for this since it happens more at lighter loads / higher velocities. However, this wasn’t observed in McGrath et al. (2018) assessment of the bench press as they didn’t report SEM. They reported relative variation values using CV and greater error was observed at heavier loads. Again this makes sense since the error is a much higher percentrage of the actual observed values. Here we see the absolute error being greater at lower loads / higher velocities, but it affects the data less since it is a lower percentage of the total values.
A new statistic was introduced here that we had not observed to date in Band 1.0 validation, the smallest worthwhile change (SWC). In this case it was computed as the between-subject standard deviation multiplied by 0.2, with that value relating to a small, but not trivial effect size using Hopkin’s modified Cohen’s d. In some cases 0.3 is used which would increase this value. In practice, this represent the smallest value one can meaningfully infer to determine that an actual change happened beyond random noise. There are some caveats to using this value. In all cases observed here, the SEM is greater than the SWC, therefore the SWC is nil because the tool’s error is greater than the random noise. SEM becomes the absolute minimam value for consideration of a potentially meaningful change. The same principle can be used with CV if that value is higher than the SEM or SWC. While those values were not reported in this study, chances are based on all of the previous data this value would be larger as well. Personally I prefer using the SEM since there are some issues with using the absolute CV in this type of study as discussed above and in the previous post. However one should still be aware of all potential sources of noise within their measures.
Taking a deeper dive into the data, a comprehensive statistical analysis was used starting with the basics, looking at the Pearson product-moment correlations and Bland-Altman limits of agreement between the two devices to assess validity. Additionally the intraclass corelation (ICC) and relative standard error of measurement (SEM%) were used to assess reliability. One thing to note is that the LPT used in this study is not a gold standard tool, so this is more of a device comparison versus a true validation. The primary author has written another paper using similar methods to assess this LPT and it is far from perfect in its performance as well. I still run into many practitioners that think various LPT's are gold standard measures. Not all LPT’s are equal and some are definitely better than others, but that is beyond the scope of this article.
Figure 8: PUSH Band 1.0 results for back squat and bench press measurements of mean velocity (top left), peak velocity (top right), mean power (bottom left) and mean velocity (bottom right)
There were some interesting trends observed in the specific metric plots (Figure 2). First, we see more underestimation bias in the squat data then we do with the bench data, which is the opposite of what we have seen in previous data. In contrast, similar to the previous work the data here show more relative error at higher loads in the velocity as discussed above. Also, the power data don’t align very well between the two devices, but again this wasn’t a concern for us. We use different mass models under the hood which will definitely cause discrepencies in the power and force data outputs. Overall, the data in this study weren’t overly favourable for the Band 1.0 and there were plenty of inconsistencies in the advanced statistical analysis of the individual metrics.
At the same time as the publicaiton of the last two studies, we were conducting our first round of internal validations under our new process at the University of Toronto Biomechanics Lab. Comparing all metrics against the gold standard, we also managed to tag onto my colleagues study looking at bench press in athletes and powerlifters. Through these first rounds of investiation we found gained some great insight into some key limitations of the PUSH Band 1.0. While it was still a viable tool for certain use cases in the daily training environment, it wasn’t up to the level of accuracy I was looking for as the data above would suggest. This caused us to go back to the drawing board and look for some new solutions to solve the inconsistencies we were observing and bring us closer to the gold standard. This is where we started exploring creating a next-generation device with new sensors. The first-generation Band 1.0 was 5 years old at this point, and new hardware would give us much better raw data to work with.
Examined two separate lifts using the same methodologies: freeweight barbell back squat and bench press
Separated data into relative 1RM percentages to assess a full spectrum of loading scenarios
Robust statistical model for validity and reliability
Commercial LPT system for criterion measures of validity - comparison study versus true validation against gold standard
Greater absolute error observed at lower load / higher velocity conditions, but greater relative error observed at higher load / lower velocity conditioning
Larger bias observed in the squat exercise compared to bench press
Non-systematic inconsistencies across numerous individual variables
Given the results of the last three studies combined with internal data leading us to find some inherent limitations - it was time to explore adding better sensors to the hardware in order to increase device performance and improve the data being collected at the hardware level
CONCLUSIONS AFTER ROUND 2…
Looking across the 3 studies discussed above on the PUSH Band 1.0, they used more advanced statistical analysis to poke some larger holes in the performance of the device. This was done using two freeweight barbell exercises that are primarily used in velocity-based training. These results were in contrast to what we observed in the first block of studies which gave us nothing but positive reinforcement. However, these papers helped us find some errors within the software and algorithms to make our data tighter. However, the introduction of a new internal scientific process against gold standard methodologies to reproduce what was observed above, it led us to start looking at new hardware solutions to deal with some inherent limitations of the Band 1.0. These studies were published 5 years after the Band 1.0’s inception (think of how many phone versions have come out in that time), and with the advances in bluetooth, battery and sensor technology it was time start looking at making a next-generation device. This round of studies research definitely wasn’t as positive as the previous round, but the external feedback was extremely valuable nonetheless, challenging us to adapt and improve quickly.
Next post we will look at studies 7-9. Stay tuned!