My AI-Powered Bowling Machine
Overview
Cricket is being passionately followed by the huge population of the sub-continent. However, unlike other sports cricket coaching is still missing the use of data analysis to improve a batsman's skill. I designed an intelligent bowling machine which can detect not only what shot a batsman has played but also if a shot has been played correctly or not. Moreover, it points out the mistake made by the batsman too using multi-tasking learning. I spent many months building the mechanical structure of the machine and experimenting with deep learning CNN's to classify cricket shots. This page has some of the technical details of the machine.
Schematic of the bowling machine. I used an Arduino Mega; Servo motors for controlling the wheels to adjust the speed and swing of the ball; stepper motors for changing the X and Y axis to control the line and length of the ball. The machine is controlled via a bluetooth module which connects with a mobile app running on a phone mounted on the device. The phone's camera also captures the video of the batsman and runs video analytics using LSTM. The coordinates, speed and swing of the next ball is decided via a rules-engine; passed back to the machine over bluetooth.
Design and Methodology
​
I started by dividing the problem into multiple stages. The first stage was predicting (classifying) the type of shot played by the batsman. To simplify, I picked only four classes of shots
​
-
C1: Pull shot
-
C2: Cover Drive
-
C3: Sweep
-
C4: Cut
​
I explored the problem through a supervised learning paradigm, that is I trained our machine learning model over a tagged set of videos. Since this is a temporal and sequential data for classification, I employed LSTM (Long Short Term Memory) structures.
A naive way for such would have been to use pre-trained (e.g. on ImageNet) Deep CNN as the backbone and extract features of each frame. Then train LSTM (long short-term memory networks) over the extracted features. However, such a design would suffer from many limitations including challenges to handling view variations, batsman/batswoman’s height, clothing/team kit, pad, change in light condition and reflection, camera zoom, background, etc… To overcome these limitations I required a substantial amount of data. Unfortunately collecting and tagging that amount of data is time-consuming.
​
​
​
​
​
​
​
​
​
​
​
​
Instead, based on an article by Ayinaparthi, I divided the problem into two steps. The first step is to identify the pose of the batsman/batswoman in each frame. As a proxy to the pose of a batsman, I predict the human-body-keypoints. To detect the pose key points, I use Mediapipe’s Pose Landmark Detector. I extracted landmarks for 17 different parts of the body which are necessary to estimate the accurate pose.
Corresponding keypoints across frames are subtracted to compute the change in keypoints as the batsman moves. Where predicted keypoints were in image coordinates, change is independent of where the batsman was in the image.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
The second step is to train LSTM over these where T is the total number of frames in the video. Here the hypothesis is that that such LSTM will identify how variations in pose-change are associated with the shot type or quality.
​
Datasets:
Initial Dataset
​
We collected ideos for C1, C2, C3 and C4 respectively. Each video consists of a batsman playing one of the shots only.
-
C1: Pull shot
-
C2: cover drive
-
C3: sweep
-
C4: Cut
​
Second Dataset
In the second dataset more detailed annotations were performed. A batsman was instructed to play the following shots. Batsman was also instructed to play each shot in the “correct” way and in some cases “wrong” way. For the "wrong" way, we also collected separate videos in different styles for some shots.
​
-
C1: pull shot
-
C2: cover drive
-
C3: sweep
-
C4: Cut
​
Experiments
​
Training videos have a different number of frames so we trimmed the longer videos from the start to make them shorter and for shorter videos, we replicate the first frame N - T times to make frames length T equal to N for all videos. N is the sequence size for our LSTM model.
For training our model, we set N = 50, as actual cricket shots can be performed approximately within 50 frames (2sec). We have 41 total training videos in our dataset. We split the dataset into training and testing sets. The training set contains 30 videos and the testing set contains 11 videos. Due to the limited videos in the dataset, we used a test set as a validation set. We trained our model for 40 epochs and used Adam optimizer with a learning rate of 0.001 to update the weights of the model.
​
For the correctness of the shot our design remains the same, we just add the extra task of learning the correctness of the shot too. In this, we use multi-task learning, forcing the network to learn features that are important for both tasks. The design diagram is below.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Limitation
Our keypoint estimation is still not normalized to be invariant for the large camera view variations.
What we have not done:
​
-
Predicted start and end of the shot
-
Identified batsman so that it could be deployed in the multi-player scenario.
-
This will require detecting the batsman/batswoman and then tracking it over the videos.
-
-
Localized the “bat” or using the information of the bat for shot classification.
Predicting the Error Made by Batsman:
​
We use multi-tasking learning to predict what error the player is making while executing the shot. This is still a work-in-progress as we are collecting more data to refine our outputs for the following:
-
Batting stance,
-
Balance,
-
Head position,
-
Feet movement,
-
Bat close or away from the body,
-
Straight or angled or cross bat shot,
-
Bat speed,
-
High elbows,
-
Shot classification
Conclusion
​
We trained a deep learning-based neural network for the cricket shot estimation. For that, we developed two different datasets. The first one is from online videos (e.g. YouTube). The second one where the actor was asked to perform different shots including some with incorrect ways to play that shot. Our algorithm extracts poses from each frame and then changes in poses between consecutive frames are used to train LSTM. Our algorithm resulted in a reasonable accuracy and the work goes on for multi-tasking learning ...
​
​