During a development period starting in June 2011 and ending in June 2012 (see the schedule), the participants will develop their system using their own software and hardware equipment and any data they want. The participants will have to interface their system with the scoring software provided by the organizers (a first version will be released in the summer 2011 and the registered participants will be regularly updated if new versions are released). The competition will take place in June 2012 in a live event where the competitors will have to bring their system for evaluation (tentatively at CVPR 2012).

The participants may optionally:

    • participate in a data exchange following strict specifications;
    • submit performance results obtained with the evaluation software;
    • participate in a milestone event to demonstrate their work-in-progress and exchange ideas (at ICCV 2011 in November);
    • contribute a paper to a special topic of JMLR on gesture recognition;
    • contribute a video demonstrating their system.

Active participation will be rewarded with travel awards to attend the final live competition.

The organizers will provide the scoring software, sample code, and sample evaluation data to registered participants. Registration is completed by signing up to the Google group gesturechallenge. This requires having a Google account.


The participants will be evaluated on "single signer", "small vocabulary" recognition tasks of short continuous sequences of signs. The camera will frame the upper torso. A typical vocabulary consists of 8 to 15 signs or hand gestures. Occasionally a gesture may include head movements or facial expressions. Emphasis will be put on learning from very few examples, and in particular, learning from just one example (one-shot-learning). A typical task will consist in performing recognizing a short sequences of one to five signs or gestures. The data are organized in batches of approximately 50 gesture sequences from the same vocabulary, including a total of 100 gesture tokens. Each task always starts with the presentation of one example of each gesture token "isolated" and its target label. Then additional short sequences of gestures are presented to the system. The system must recognize the gesture sequence and then may be provided with the correct labels for additional training (on-line learning). The performance of the system is measured by its accuracy on predicting correctly the labels of new examples not used (yet) for training.

At the site of the final live evaluation, the participants will be provided with pre-recorded tasks and evaluated quantitatively with the evaluation software. The best systems will then perform a qualitative live demonstration in from of an audience. A panel of experts will declare the winner, based on the quantitative and qualitative evaluations.