Please review the NEW RULES (updated because of Kaggle sponsorship).
You can download the development and validation data from here.
1. Challenge description and rules
The focus of the challenge is on “multiple instance, user independent learning” of gestures, which means learning to recognize gestures from several instances for each category performed by different users, drawn from a gesture vocabulary of 20 categories. A gesture vocabulary is a set of unique gestures, generally related to a particular task. In this challenge we will focus on the recognition of a vocabulary of 20 Italian cultural/anthropological signs.
We highly recommend that the participants take advantage of this opportunity and upload regularly updated versions of their code during the development period. Their last code submission before deadline will be used for the verification.
What do you need to predict?
The development data contains the recording of multi-modal RGB-Depth-Audio data and user mask and skeleton information of 7,754 gesture instances from a vocabulary of 20 gesture categories of Italian signs. Next we list the types of gestures, represented by a numeric label (from 1 to 20), together with the number of training performances recorded for each gesture in brackets:
1. ‘vattene’ (389)
2. 'vieniqui' (390)
3. 'perfetto’ (388)
4. 'furbo' (388)
5. 'cheduepalle' (389)
6. 'chevuoi' (387)
7. 'daccordo' (372)
8. 'seipazzo' (388)
9. 'combinato' (388)
10. 'freganiente' (387)
11. 'ok' (391)
12. 'cosatifarei' (388)
13. 'basta' (388)
14. 'prendere' (390)
15. 'noncenepiu' (383)
16. 'fame' (391)
17. 'tantotempo' (389)
18. 'buonissimo' (391)
19. 'messidaccordo' (389)
20. 'sonostufo' (388)
These labels constitute the ground-truth and is provided with the development data, for which a soft segmentation annotation has been performed.
Then, in the data used for evaluation (called validation data and final evaluation data with 3,362 and 2,742 Italian gestures, respectively), you get several video clips with annotated gesture labels for training purposes. Multiple gesture instances from several users will be available. You must predict the labels of the gestures played in the other unlabeled videos.
For each video, participants should provide an ordered list of labels R corresponding to the recognized gestures, one label only per recognized gesture. We will compare this list to the corresponding list of labels T in the prescribed list of gestures that the user had play. These are the "true" gesture labels.
For evaluation, we consider the so-called Levenshtein distance L(R, T), that is the minimum number of edit operations (substitution, insertion, or deletion) that one has to perform to go from R to T (or vice versa). The Levenhstein distance is also known as "edit distance".
L([1 2 4], [3 2]) = 2
L(, ) = 1
L([2 2 2], ) = 2
The overall score we compute is the sum of the Levenshtein distances for all the lines of the result file compared to the corresponding lines in the truth value file, divided by the total number of gestures in the truth value file. This score is analogous to an error rate. However, it can exceed one.
- Public score means the score that appears on the leaderboard during the development period and is based on the validation data.
- Final score means the score that will be computed on the final evaluation data released at the end of the development period, which will not be revealed until the challenge is over. The final score will be used to rank the participants and determine the prizes.
To verify that the participants complied with the rule that there should be no manual labeling of the test data, the top ranking participants eligible to win prizes will be asked to cooperate with the organizers to reproduce their results.
During the development period the participants can upload executable code reproducing their results together with their submissions. The organizers will evaluate requests to support particular platforms, but do not commit to support all platforms. The sooner a version of the code is uploaded, the highest the chances that the organizers will succeed in running it on their platform. The burden of proof will rest on the participants.
The code will be kept in confidence and used only for verification purpose after the challenge is over. The code submitted will need to be standalone and in particular it will not be allowed to access the Internet. It will need to be capable of training models from the final evaluation data training examples, for each data batch, and making label predictions on the test examples of that batch.
We split the recorded data (which be can downloaded from here) into:
- development data: fully labeled data that can be used for training and validation as desired.
- validation data: a dataset formatted in a similar way as the final evaluation data that can be used to practice making submissions on the Kaggle platform. The results on validation data will show immediately as the "public score" on the leaderboard. The validation data is slightly easier than the development data.
- final evaluation data: the dataset that will be used to compute the final score (will be released shortly before the end of the challenge).
2. Description of data structures
All the data and code can be downloaded from here.
Please, provide your feedback and questions to email@example.com
We provide the individual X_audio.ogg, X_audio.wav, X_color.mp4, X_depth.mp4, and X_user.mp4 files containing the audio, RGB, depth, and user mask videos for a given sequence X. All the sequences are recorded at 20 FPS.
We also provide a script interface, dataViewer, in order to visualize the samples and export the data to Matlab. If dataViewer is used, do not unzip the sample data. Just type at the Matlab prompt:
The implemented options are:1.- Load the sample data with the Load Data button to load one of the zip files [be patient, it takes some time to load].
2.- Visualize all the multimodal data stored in the sample.
3.- Play the audio file stored in the sample.
4.- Export the multimodal data in the following four Matlab structures:
• NumFrames: Total number of frames.
• FrameRate: Frame rate of a video.
• Audio: structure that contains audio data.
• y: audio data
• fs: sample rate for the data
• Labels: structure that contains the data about labels.
• Name: The name given to this gesture.
• Begin: It indicates the start frame of the gesture.
• End: It indicates the ending frame of the gesture.
• Label names are as follows:
After exportation, an individual mat (Sample00001_X.mat, where X indicates the number of the frame) file for each frame is generated containing the following structures:
The detailed descriptions about each one of these structures are explained in the following section.
On the generated MAT
A generated MAT file stores the selected visual data in three structures named ‘RGB’, ‘Depth’, ‘UserIndex’, and ‘Skeleton’. The data of each structure is aligned regarding RGB data. The following subsections will describe each of these structures.
• This matrix represents the RGB color image.-Depth Frame:
• The Depth matrix contains the pixel-wise z-component of the point cloud. The value of depth is expressed in millimeters.-User Index:
• The User index matrix represents the player index of each pixel of the depth image. A non-zero pixel value signifies that a tracked subject occupies the pixel, and a value of 0 denotes that no tracked subject occupies the pixel.-Skeleton Frame:
• An array of Skeleton structures is contained within a Skeletons array. It contains the joint positions, and bone orientations comprising a skeleton. The format of a Skeleton structure is given below.
• JointsType can be as follows:
• WorldPosition: The world coordinates position structure represents the global position of a tracked joint.
• The X value represents the x-component of the subject’s global position (in millimeters).
• The Y value represents the y-component of the subject’s global position (in millimeters).
• The Z value represents the z-component of the subject’s global position (in millimeters).
• PixelPosition: The pixel coordinates position structure represents the position of a tracked joint. The format of the Position structure is given as follows.
• The X value represents the x-component of the joint location (in pixels).
• The Y value represents the y-component of the joint location (in pixels).
• The Z value represents the z-component of the joint location (in pixels).
• WorldRotation: The world rotation structure contains the orientations of skeletal bones in terms of absolute transformations. The world rotation structure provides the orientation of a bone in the 3D camera space. Is formed by 20x4 matrix, where each row contains the W, X, Y, Z values of the quaternion related to the rotation.
• The X value represents the x-component of the quaternion.
• The Y value represents the y-component of the quaternion.
• The Z value represents the z-component of the quaternion.
• The W value represents the w-component of the quaternion.