Holistic detection (tracking face, eye and hand) with an excel output
- detect the face, eye, posture and hands using mediapipe holisitic
- capture and save all the static picture with trackings frame-by-frame
- export an csv file with all the x, y, z coordinate of each region frame-by-frame
- face_eye_hand_detection_excel.py --> The main python script
- Yourvideo.csv --> An example of the export csv file
Before running the script
- Anaconda prompt
- conda create -n mediapipe python=3.8
- conda activate mediapipe
Install these dependencies (mediapipe, opencv, numpy, pandas):
pip install mediapipe
pip install opencv-python
pip install numpy
pip install pandas
python face_eye_hand_detection_excle.py
note: this script can only run for one video. For more videos, it should be written in loop
VIDEO_PATH = "Yourvideo.mp4" change video name in the current path
path_image = "Yourvideo" create a folder to put the ouput images in the current patch
outputfile_csv = "Yourvideo.csv" create a csv file to store the ouptut values
- A screenshot of the static picture with face, eye and hand tracking
- On the left top corner, it shows the fps of current frame
- If anything covers the face, the estimate tracking of the face and eye might not be that accurate
- x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.
- z: Should be discarded as currently the model is not fully trained to predict depth, but this is something on the roadmap.
- visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible (present and not occluded) in the image.
- A list of 468 face landmarks. Each landmark consists of x, y and z. x and y are normalized to [0.0, 1.0] by the image width and height respectively.
- z represents the landmark depth with the depth at center of the head being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x.
- A list of 21 hand landmarks on the left hand. Each landmark consists of x, y and z. x and y are normalized to [0.0, 1.0] by the image width and height respectively.
- z represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x.