This project allows a user to control a Roku Stick through computer vision and machine learning. It uses MediaPipe to detect hand landmarks which are then normalized and passed into a pre-trained Logistic Regression model to classify the current hand pose. Based on the predicted pose and its confidence score, a corresponding Roku remote command(e.g., volume, navigation, select) is triggered using HTTP requests.
This system captures live video input through OpenCV and uses MediaPipe's hand tracking module to extract 3D keypoint data from a user's hand. The landmarks are normalized by converting the wrist position to the origin, flattened into a vector, and then used as input to a trained scikit-learn model that predicts the current pose.
Each recognized pose corresponds to a Roku remote command(e.g., volume, navigation, select). To avoid accidental actions, a neutral pose is required before any command is triggered.
- Python 3.x
- OpenCV
- MediaPipe
- NumPy
- scikit-learn
- curl(for Roku control via HTTP)
- Change model_path variable to your computers path to the model in main.py
- Change the variable ip_address to your Roku's IP address (ex. '100.100.100.100') inside make_action() to your Roku's IP address from the utils.py file
- On your Roku Stick's settings go to Settings > Network Access > Permissive (will not work if not allowed to be accessed through network)
- Recorded 7 different videos of myself doing 7 different poses 50 times each hand (100 repetitions in total per pose):
- Created data collection program to allow for data collection on keypress.
- Manually watched every video, pressed data collection button twice at the end of every pose to try and eliminate bias. (Came out to be about 200 labeled samples for each class.)
- Normalized each keypoint (x, y, z) by subtracting the wrist keypoint of that current sample from each keypoint.
(If someone wants to improve please do!)
- Add feature that automatically gets users' Roku IP address
- Model struggles to detect hands from about 3 feet away.
(Needs more training data for all poses.) - Model is mostly consistent but has trouble differentiating between swipe and select.
(Currently, the user must hold up 4 fingers and tuck the thumb to the palm before swiping — otherwise, it gets classified as "select.")- Possible fix: add more training data or tweak parameters in the main
if
logic.
- Possible fix: add more training data or tweak parameters in the main
- Add more gestures to support additional Roku commands
(e.g., power on/off, home, select specific channels, etc.)
This project was created to get hands on experience with data creation and cleanup of keypoint data gathered from pose estimation models like MediaPipe, and training a low cost AI model. This project is sort of a stepping stone as it uses components I will use for a future project. It was a fun experience and I plan to come back to it and further improve the models accuracy to make it a serious alternative to using a remote.