We are portraying a single user in front of a fixed camera, interacting with a computer by performing gestures to
- play a game,
- remotely control appliances or robots, or
- learn to perform gestures from an educational software.
KinectTM has revolutionized the field of gesture recognition because it is an affordable device (a high end webcam) providing both RGB and depth images. Depth images facilitate image segmentation considerably. We have collected a large dataset of 50,000 gestures with KinectTM. We provide MatlabTM code to browse though the data and process it to create a sample submission. The data can also be viewed with most video viewers, see the README file for details.
We provide both the RGB image and the depth image as in the example below. View more examples.
The data are organized in batches
final01 (final evaluation data for round 1)
final21 (final evaluation data for round 2, not published yet)
Each batch includes 100 recorded gestures grouped in sequences of 1 to 5 gestures performed by the same user. The gestures are drawn from a small vocabulary of 8 to 15 unique gestures, which we call a "lexicon" (see a few examples of the lexicons we used).
We selected lexicons from nine categories corresponding to various settings or application domains; they include (1) body language gestures (like scratching your head, crossing your arms), (2) gesticulations performed to accompany speech, (3) illustrators (like Italian gestures), (4) emblems (like Indian Mudras), (5) signs (from sign languages for the deaf), (6) signals (like referee signals, diving signals, or mashalling signals to guide machinery or vehicle), (7) actions (like drinking or writing), (8) pantomimes (gestures made to mimic actions), and (9) dance postures.
Goal of the challenge: one-shot-learning
For the develXX batches, we provide all the labels. For the validXX and finalXX batches, we provide labels only for one examle of each class. The goal is to predict the gesture class labels for the remaining gesture sequences.
During the development period, performance feed-back will be provided on-line on the validXX batches. The final evaluation will be carried out on the finalXX batches and the final results will be revealed only when the challenge is over.
What is easy about the data:
- Fixed camera
- Availability of depth data
- Single user within a batch
- Homogeneous recording conditions within a batch
- Small vocabulary within a batch
- Gestures separated by returning to a resting position
- Gestures performed mostly by arms and hands
- Camera framing mostly the upper body (some exceptions)
What is hard about the data:
- Only one labeled example of each unique gestures
- Variations in recording conditions (various backgrounds, clothing, skin colors, lighting, temperature, resolution)
- Some parts of the body may be occluded
- Some users are less skilled than others
- Some users made errors or omissions in performing the gestures
We provide some data annotations, including temporal segmentation into isolated gestures and body part annotations (head, shoulders, elbows, and hands).