Machine Learning & CrossFit - Perfect Match?

How to use Deep Learning methods to analyse CrossFit workout videos?

Let us jump straight to the results, and discuss the details after. You can see Kristin Holte's detailed Event 3 – 'Damn Diane' breakdown below, generated via machine learning methods.

Behind the scenes

Everything started with Tamás and me wanting to understand better how the breakdown times of Event 3 – 'Damn Diane' of the 2020 CrossFit Games look like. In a nutshell, this workout consists of 3 rounds of 15 deadlifts and 15 strict deficit handstand push-ups.

As we can see in the chart below, although we are talking about 60 elite athletes, the finish times range really widely.


Our hypothesis was that the main driver behind these big differences is the strict deficit handstand push-up.

We were curious about

  • how drastically the time of the handstand push-ups increased from round to round,
  • how they broke up the 15 repetitions,
  • whether they broke up them from the first round or not.

First attemp – Manually recording the breakdown times

We spent quite decent amount of time watching the full-workout videos about Event 3, published by the Official CrossFit Games Youtube Channel, and recording manually the finish times of each movement in each round.

You can see the result below. Not surprisingly, the hypothesis proved to true.


This chart shows just the top and bottom 3 athletes among those whose full-workout videos have been released. You can inspect the full list under the Charts/Damn Diane tab.

Second attemp – Using Deep Learning methods to do the job instead of us

The question was – could we automate this? And the answer turned to be yes.

Let us go through the steps using Kristin Holte's video as example.

Pose Estimation with AlphaPose

The first step was to estimate the different poses of the athlete – i.e., to localize her body joints (also known as keypoints - ears, eyes, elbows, knees, ankles etc.) across all the frames (images) of the full-workout video.

Since there are plenty other people in the video, we needed a multi-person pose estimator. We used the AlphaPose estimation system for this purpose. You can see a fragment from the results below.


Pose Tracking with Pose Flow

So far, so good. The keypoints estimation is done for (almost) all people on individual frames (images). The next step is to track the people from frame to frame. We used Pose Flow for this purpose. You can see a fragment from the results below. The different colors represent different people. You might notice that the color of some people change over time, so the tracking has failed for them (due to overlapping or other reasons).

Kristin Holte's tracking was successful, even though she was out of sight for a couple of seconds because of the camera man.


Athlete Identification

The output of the previous step is that we get a numerical ID assigned to the keypoints detected across all the frames. We know the pose of Person #1, Person #2, ... Person #N in each frame.

How do we know which ID represents the person of our interest, Kristin Holte?

Since it is an entirely task-dependent problem, we are in advantageous situation now. We can identify the athlete in the Event 3 – 'Damn Diane' full-workout videos as the only person being sometimes upside-down across the frames.

Movement Detection

We identified Kristin, i.e., we know the numerical ID assigned to her. In our case she was detected and tracked as Person #4. To identify the exact movement she performs – deadlift, handstand push-up or break/transition, we select two keypoints out of the 17 – the Right Ear and Right Ankle – and look at the time-series of their pixel-wise y-coordinates.


Zooming into the first 45 seconds of the video, we can clearly see how her right ear moved up and down during the deadlifts, when she swung into handstand position – the y-coordinate of the right ankle got higher than the right ear's y-coordinate, how her ear and ankle moved up and down during the handstand push-ups. Moreover, we can clearly recognize the breaks and transitions.

Running a peak finding algorithm on the time-series of the Right Ear's y-coordinate – more precisely, on the smoothed moving average – we can detect the exact timestamp when a movement was performed. To decide whether it was a deadlift or a handstand push-up, we just check if Kristin was upside down at that time or not.

Finally, we arrived to the point where we can study the detailed timeline how Kristin Holte performed the entire Event 3 – 'Damn Diane' workout.

While the manually recorded version showed just the time durations of a set of movement per round, now we can clearly see the rest and transition times, and how the 15-15 repetitions were broken up.


This experiment gives a little, empirical evidence for Machine Learning, Computer Vision methods being applicable and beneficial for analysing CrossFit workout videos.

We have applied the pre-trained AlphaPose Estimation models, and used just 2 out of the 17 keypoints for movement detection. Combining the time-series of multiple keypoints will most likely lead to more robust and generic solutions.


by Katinka Páll on 09 Oct, 2020