Human Action Recognition (HAR) Classification Using MediaPipe and Long Short-Term Memory (LSTM)

Human Action Recognition is an important research topic in Machine Learning and Computer Vision domains. One of the proposed methods is a combination of MediaPipe library and Long Short-Term Memory concerning the testing accuracy and training duration as indicators to evaluate the model performance. This research tried to adapt proposed LSTM models to implement HAR with image features extracted by MediaPipe library. There would be a comparison between LSTM models based on their testing accuracy and training duration. This research was conducted under OSEMN methods (Obtain, Scrub, Explore, Model, and iNterpret). The dataset was preprocessed Weizmann dataset with data preprocessing and data augmentation implementations. Video features extracted by MediaPipe: Pose was used in training and validation processes on neural network models focusing on Long Short-Term Memory layers. The processes were finished by model performance evaluation based on confusion matrices interpretation and calculations of accuracy, error rate, precision, recall, and F1score. This research yielded seven LSTM model variants with the highest testing accuracy at 82%, taking 10 minutes and 50 seconds of training duration.


Introduction
The development of computer technology in machine learning and computer vision domain is still far from being done. The uniqueness of various imagery data from various sources becomes interesting research material. Information gathering from images can be done through Human Action Recognition (HAR). HAR is an important issue due to its various implementations, e.g., surveillance videos, human-machine interaction, and other ways of information-gathering from videos (Cheng et al., 2015). A proposed method used a combination of MediaPipe as image features detector and extractor along with Long Short-Term Memory (LSTM) as an identifier or classifier. This combination can be found in Hand Gesture Recognition (HGR) research (Agrawal et al., 2022;Ghosh, 2021;Lakkapragada et al., 2022;Moetia Putri & Fuadi, 2022). HAR can also be conducted with similar methods (Daniel Tanugraha et al., 2022;. Zhang had done previous research in HAR ) used NTU, SBU, and SYSU datasets consisting of sequences of skeleton 3D data. These data were used as input for LSTM model constructed by 3 LSTM layers with a Fully-connected layer as a classifier. This implementation results in accuracies of 87.6%, 97.2%, and 77.5% for NTU, SBU, and SYSU datasets, respectively.
Ghosh conducted research with a classification in 5 classes from a dataset of 126 videos (Ghosh, 2021). The video features were extracted by MediaPipe: Hands with ignored z-axis. The LSTM architecture consisted of 2 LSTM layers, 2 dropout layers, 1 flatten layer, and 1 dense layer. This research results in an accuracy of 94%.
With a combination of one layer for each LSTM layer, dropout layer, and dense layer, this research (Lakkapragada et al., 2022) could obtain a model with a testing accuracy of 69.55%. The input data was gathered from the extraction of the Self-Stimulatory Behavior Dataset (SSBD) with MediaPipe.
Research in Sports Action Recognition Based on Long Short-Term Memory Using MediaPipe (Daniel Tanugraha et al., 2022) showed that their LSTM model needed a training time of 10 to 12 minutes. This model used RNN for Human Activity Recognition-2D dataset for its training phase. The validation accuracies in T-Pose, Warrior II Pose, and Tree-Pose were obtained at 100%, 85%, and 80%, respectively. Agrawal et al. (2022) did research in HGR consisting of 10 gestures using MediaPipe: Holistic. The model which implemented 4 LSTM layers and 3 Dense layers could result in testing and validation accuracies of 90%.
Another HGR research using LSTM was conducted by Putri et al. (2022). They used the BISINDO gesture dataset consisting of 30 vocabularies. The gesture was extracted with MediaPipe: Holistic in advance before it was set as input data for 3 variants of LSTM model, i.e., 1 layer LSTM, 2 layers LSTM, and Bidirectional LSTM. The highest accuracy reached 94%, 97%, and 96% for 1 layer LSTM, 2 layers LSTM, and Bidirectional LSTM, respectively.
Those researches showed that detection accuracy was the primary indicator in model evaluation. However, the training time as an essential factor in Deep Learning (Sarker, 2021) model construction was shown only in (Daniel Tanugraha et al., 2022). Other research in Deep Learning showed that the accuracy value and training time could be used to determine the best model Codreanu et al., 2017;Tan & Le, 2021).
Based on this condition, this research tried to adapt the LSTM model (Ghosh, 2021;Lakkapragada et al., 2022;Zhang et al., 2017) for HAR implementation with landmarks data extracted with MediaPipe library (Google LLC, 2020). We also made an accuracy and training time comparison between those models and our selfconstructed models based on parameters recommended by research (Reimers & Gurevych, 2017).

Research Method 2.1 Research Tools Specification
This research was conducted in a Lenovo Ideapad 100-14IBD laptop series for hardware powered with Intel Core i35005U processor, Intel HD Graphics 5500 VGA, and 6 GB of RAM. In the software aspect, this research implemented in Microsoft Windows 10 Pro 64-bit operating system, Jupyter Notebook 6.4.5 Anaconda 3 for IDE, python 3.9 programming language, and Kdenlive 21.12.3 as video editing application.

Research Workflow
The workflow for this research adopted the OSEMN model. OSEMN (read 'awesome') is a model or a method in data science introduced by Masson and Wiggins (Mason & Wiggins, 2010). OSEMN consisted of chronological steps called Obtain, Scrub, Explore, Model, and iNterpret. Since the model scheme of OSEMN used to be implemented with a customized sequence (Janssens, 2021), the research method followed the workflow as shown in Figure 1.

Obtain-1
The dataset used in this research was Weizmann dataset (Gorelick et al., 2007) which was directly downloaded from its official website at https://www.wisdom.weizmann.ac.il/~vision/SpaceTime Actions.html, in the "Classification Database" section. It has 336 MB of compressed size or 454 MB in uncompressed form.
The weizmann dataset consists of 10 classes with a total of 93 videos. Each class has 6 videos (jack, jump, pjump, side, wave2, wave1, and bend class) or 10 videos (run, walk, and skip class). Those videos are formatted in AVI with 1 to 3 seconds of duration, 25 fps of framerate, 180 × 144 pixels of frame size, and 9 persons of actors.

Scrub-1
As a prerequisite, the dataset used in this research was gathered from 25 former frames of each video. However, there was a video titled "ira_bend.avi" unfulfilling mentioned prerequisite, hench the preprocessing technique (Minh et al., 2018) called features selection (Beniwal et al., 2012) are applied to that video. This technique included a duration cutting implementation on 20 former frames, so a representative video was obtained compared to its class. Figure 2 shows an example of the duration cutting technique.
The Weizmann dataset was known for having 10 classes with 9 to 10 videos each. Due to the small-sized dataset (Wang et al., 2017) for 60:20:20 division ratio and the imbalanced data distribution, this research also implemented data augmentation techniques to the dataset through Kdenlive video editor. These techniques were  mirroring, zooming, translation, and their combinations (Verdhan, 2021). All videos were augmented 1, 2, or 4 times. Table 1 shows an example of data augmentation processes.

Explore-1
After Scrub-1 was done, the dataset had 450 videos (45 videos per class) with 461 MB of data size. Since the data ratio was divided into 60 : 20 : 20 for training : validation : testing, the training data consisted of 270 videos (27 videos per class), the validation data consisted of 90 videos (9 videos per class), and the testing data had the same number of videos as the validation data that was 90 videos (9 videos per class).

Obtain-2
In Obtain-2, there were detection processes and video feature extraction using MediaPipe: Pose library. The detection phase was done by first converting video frames from BGR to RGB format using OpenCV (Bradski, 2000). These frames were then used for landmarks detection by MediaPipe.
The gathered data were 33 points of landmarks or key points with x, y, and z values in each point multiplied by the number of videos. Those data were saved in 25 NumPy array formatted files (.npy) for each video. Figure  3 illustrates the process of Obtain-2.

Scrub-2
In this step, the data obtained from Obtain-2 were labeled with a class code for every 25 frames of data (Amershi et al., 2019). The class codes mentioned ranged from 0 to 9 for the run, walk, skip, jack, jump, pjump, side, wave2, wave1, and bend, respectively.

Explore-2
The data obtained from Scrub-2 were multidimensional arrays for each training, validation, and testing data. Those data were distinguished into 2 groups, namely X and y. Table 2 shows the array dimension of data X. Meanwhile, data y had the array dimension, as shown in Table 3 below.

Model
The neural network model design was adapted from 4 LSTM models used in research (Ghosh, 2021;Lakkapragada et al., 2022;Zhang et al., 2017). We determined those models as VA-LSTTM-SYSU             Figure 7. Furthermore, we also designed 3 variants of models as a comparison to the former models. Those models had several same hyperparameters configuration, namely, (25, 132) for input shape, Nadam optimizer, categorical crossentropy for the loss function, categorical accuracy metrics, 200 epochs, and True value for the shuffle.
In addition to initialized hyperparameters above, the models were constructed with some recommendations, such as keeping LSTM layers in minimum number, the usage of dropout (especially a variational dropout), and small batch size (Reimers & Gurevych, 2017). To optimize the models, this research used Model Checkpoint (Keras, n.d.-b) function in the training process to save the best model weight for each epoch based on the value of validation categorical accuracy.
Referring to the recommendations above, three neural network models were designed. a. Model 1 This model contained 2 LSTM layers, 2 Dropout layers, and 2 Dense layers. The dropout layers were positioned after each LSTM layer to prevent the overfitting condition probability (McCullum, 2020). Figure 8 shows the configurations of Model 1 architecture with its hyperparameters. Other hyperparameter configurations in Model 1 were 0.0001 for learning rate and 4 for batch size. b. Model 2 Figure 9 shows Model 2 constructed of 2 LSTM layers and 2 Dense layers. The dropout values were configured as variational dropout by the recurrent_dropout variable. Other hyperparameters configured in Model 2 were 0.000075 for learning rate and 2 for batch size. c. Model 3 In Model 3, the architecture was constructed similar to Model 2's yet used a different activation function in its LSTM layers, that was TanH (default value). The recurrent_dropout value also was reduced to 0.2. Figure 10 shows the architecture of Model 3 followed by its hyperparameters. Other hyperparameters were left unchanged with the same value as Model 2's.

Interpret
The interpretation was conducted by comparing the model performance based on accuracy and loss value from the training, validation, and testing process. A confusion matrix method was used to evaluate the testing process (Xu et al., 2020), followed by the calculation of accuracy, error rate, precision, and recall F1score. Each calculation implemented micro-averaging and macroaveraging methods for multi-class classification problems (Chinchor, 1992;Sokolova & Lapalme, 2009).
After those processes, this research was continued with a simple prediction implementation in a demo video. The demo video had 640 x 360 pixels of resolution and 25 fps of framerate. An actor acted the actions in the demo video with 2 variations for each action. Each variation of action was performed in 2 seconds.
Each trained model would be implemented sequentially, and then the results were predicted based on key points values in the latest 25 frames of video. The changes in detection would be recorded and interpreted in the form of narrational text.

Long Short-Term Memory
Long Short-Term Memory (LSTM) is a variety of Recurrent Neural Networks (RNN) designed for a temporal-dependent model with better accuracy than traditional RRN (Sak et al., 2014). LSTM was first introduced by Hochreiter and Schmidhuber (Hochreiter & Schmidhuber, 1997) to address error back-flow problems, i.e., blow up or vanish on the backpropagation method.
The visual of the LSTM algorithm is illustrated in Figure 11. In mathematic form, LSTM has a calculation sequence involving forget gate ( ), input gate ( ), new value that can be added to the cell state (̃), cell state ( ), output gate ( ), and output order-t (ℎ ). Firstly, was calculated with Equation 1.

= ( • [ℎ −1 , ] + )
(1) with is a sigmoid function, is weight value for , ℎ −1 is output value before order-t, is input value in order-t, and is bias value for .
After calculation, the data are processed with through Equation 2 and ̃ in Equation 3.
is weight value for , is weight value for ̃, is bias in , dan is bias in ̃. After , , dan ̃ are obtained, the can be calculated through Equation 4. = * −1 + * ̃ (4) with −1 is cell state value before order-t.
The value is obtained with Equation 5, then the with is the amount of data.

Results Analysis
This research presented reports from our seven models' training, validation, and testing phase. For training and validation, some charts of loss and accuracy rate were compared to their epoch stages. The trained and weighted models were saved locally as .h5 files by considering the highest validation categorical accuracy in

.1 Training and Validation Analysis
The training and validation process of VA-LSTM-SYSU model can be seen in Figure 12. This model reached its best weight at 189 th epoch. It had the value of training loss = 0.1923, training categorical accuracy = 0.9407, validation loss = 0.5696, and validation categorical accuracy = 0.8778. The training duration of this model was 2 minutes 33 seconds. Figure 13 shows the training and validation chart of VA-LSTM-SBU model. The best weight was 186 th epoch. The loss and accuracy values were training loss = 0.3745, training categorical accuracy = 0.8704, validation loss = 0.5259, and validation categorical accuracy = 0.8556 with 5 minutes 35 seconds of training time.
LSTM-PASL got the best weight at 92 nd epoch of the training process. As shown in Figure 14, the values were training loss = 0.0576, training categorical accuracy = 0.9963, validation loss = 0.5801, and validation categorical accuracy = 0.8444. The model training took 5 minutes and 52 seconds of duration. LSTM-AHM model acquired its best epoch at 186 th with loss and accuracy values were training loss = 0.2337, training categorical accuracy = 0.9295, validation loss = 0.6382, and validation categorical accuracy = 0.8000. This training and validation process can be seen in Figure 15 and were done in 1 minute 4 seconds.
Our Model 1 showed its best training and validation at 153 rd epoch, as shown in Figure 16  The charts of training and validation of Model 3 are depicted in Figure 18. In 24 minutes 52 seconds, this process yielded the best weight at 103 rd epoch with training loss = 0.0682, training categorical accuracy = 0.9852, validation loss = 0.5674, and validation categorical accuracy = 0.8556.
The seven models with respective weights obtained from training and validation processes were then tested in the testing process. The results from the testing process were evaluated using a confusion matrix along with its calculations of average accuracy, error rate, precision, recall, and F 1 score, in terms of micro and macro variants.
All seven matrices in Figure 19 show pretty good results of jack, jump, pjump, side, wave2, wave1, and bend actions prediction. This was shown by the True Positive (TP), which had values ranging from 6 to 9. In contrast, the prediction of the run, walk and skip actions was not good enough. The matrices show this condition from the TP values of the individual actions that vary from 2 to 7.
To summarize the True Positive (TP), False Positive (FP), True Negative (TN), and False Positive (FP), we presented Table 4 and Table 5. Then, the average accuracy, error rate, precision, recall, and F1score were evaluated in Table 6, including the training time comparison of each model. The times were formatted in a unit of minute (′) and second (″).

Model Implementation on a Demo Video
The model passing through the training and validation processes yielded some weight values that could be used as a matrix in action classification. With a self-recorded video, this research used a simple python program through a Jupyter Notebook to implement classification testing for each model. This implementation resulted in some action classifications, as shown in Table 7.

Conclusion
According to this research, there were several points to conclude. They were performance, evaluation results, and model implementation results.
Artificial Neural Network modeling using MediaPipe and Long Short-Term Memory Architecture could be done with various combinations of the input layer, hidden layer, and output layer. The number of input neurons should be customized upon input data, whereas the output neurons fitted to the number of classification classes. In the hidden layer section, customization consisted of the number of LSTM layers, Dropout layers, Dense layers, and/or Flatten layers as well as its hyperparameters.
The classification accuracies of all seven LSTM models were in the range of 77% to 82%. The highest accuracy was obtained by Model 1, whereas LSTM-AHM model obtained the lowest. From the training time aspect, LSTM-AHM model had the fastest duration, namely 1 minute 4 seconds. In contrast, Model 2 had the longest duration, 54 minutes 13 seconds.
The detections yielded some fluctuated classifications in the demo video implementation phase compared to the testing results. This condition indicated differences in model behavior when classifying static data (from training, validation, and testing dataset) compared to dynamic data (from demo video). Therefore, the generalization and detection consistency of the seven models were not good enough.