7days until
ChaLearn LAP2014 workshop

PAST EVENTS

ORGANIZERS

Challenge and Data description

Please review the NEW RULES 2014.

We provide scripts in Python to access the data and evaluate your predictions. The requirements are:
  • OpenCV 2.4.8
  • Python Imaging Library (PIL) 1.1.7
  • NumPy 1.8.0
The following files are common to all the tracks and can be downloaded from http://sunai.uoc.edu/chalearnLAP, tab "Source Code":
  • Code with the classes and methods to access the data.
  • Code for the evaluation methods.
The information and specific source code for each track can be downloaded below, and the competition server is found at: http://www.codalab.org/




To verify that the participants complied with the rule that there should be no manual labeling of the test data, the top ranking participants eligible to win prizes will be asked to cooperate with the organizers to reproduce their results.

During the development period the participants can upload executable code reproducing their results together with their submissions. The organizers will evaluate requests to support particular platforms, but do not commit to support all platforms. The sooner a version of the code is uploaded, the highest the chances that the organizers will succeed in running it on their platform. The burden of proof will rest on the participants.

The code will be kept in confidence and used only for verification purpose after the challenge is over. The code submitted will need to be standalone and in particular it will not be allowed to access the Internet. It will need to be capable of training models from the final evaluation data training examples, for each data batch, and making label predictions on the test examples of that batch.




1. Challenge description and rules for Track 1

The focus of this track is on automatic human multi-limb pose recovery on RGB data. We provide 9 sequences with 14 labelled limbs per person and frame, including 124,761 labelled human limbs for more than 8,000 frames. For each frame, we provide the RGB image and 14 binary masks corresponding to the 14 labelled limbs (if visible): Head, Torso, R-L Upper-arm, R-L Lower-arm, R-L Hand, R-L Upper-leg, R-L Lower-leg, and R-L Foot. For each binary mask, 1-valued pixels indicate the region in which the limb is contained.



1.1 Track stages:
  • Development phase: Create a learning system capable of learning from several training annotated human limbs a body pose recovery problem. Practice with development data (a large database of 4,000 manually labelled frames is available) and submit predictions on-line on validation data (2,000 labelled frames) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose. 

  • Final evaluation phase: Make predictions on the new final evaluation data (2,234 frames) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.
We highly recommend that the participants take advantage of this opportunity and upload regularly updated versions of their code during the development period. Their last code submission before deadline will be used for the verification.


1.2 Evaluation Measurement:

It will be used the Jaccard Index (overlapping). Thus, for each one of the n≤14 limbs labelled for each subject at frame i, the Jaccard Index is defined as follows:


where A_{i,n} is the ground truth of limb n, and B_{i,n} is the prediction for the same limb at image i. For the dataset in this challenge, both A_{i,n} and B_{i,n} are binary images where ‘1’ pixels denote the region in which the n-th limb is predicted.  Particularly, since A_{i,n}  (ground truth) is a binary image and 1-value pixels indicate the region of the n-th limb, this positive region does not necessarily need to be square. However, in all cases the positive region is a polyhedron defined by four points. Thus, numerator is the number ‘1’ pixels that intersects in both images A_{i,n} and B_{i,n}, and denominator is the number of union ‘1’ pixels after applying local or operator.

In the case of false positives (e.g predicting a limb that is not on the ground truth because of being occluded), the prediction will not affect the mean Hit Rate calculation. In other words n is computed as the intersection of the limb categories in the ground truth and the predictions.

Participant methods will be evaluated upon hit rate (HR) detection of limbs. That is, for each limb n at each image i, a hit will be computed if J_{i,n}≥0.5. Then, the mean hit rate among all limbs for all images will be computed (where all limb detections will have the same weight) and the participant with the highest mean hit rate will be the winner.


In some images a limb may not labelled in the ground truth because of occlusions. In that case where n<14, participants must not provide any prediction for that particular limb. An example of the mean hit rate calculation for an example of n=3 limbs and i=1 image is show next.


This figure shows the Mean hit rate and Jaccard Index calculation for a sample with n=3 limbs and i=1 image. In the top part of the image the Jaccard Index for the head limb is computed. As it is greater than 0.5 then it is computed as a hit for image i and the head limb. Similarly, for the torso limb the Jaccard Index obtained is 0.72 (center part of the image) which also computes as a hit for torso limb. In addition, in the bottom of the image the Jaccard Index obtained for the left thigh limb is shown, which does not compute as a hit since 0.04<0.5. Finally, the mean hit rate is obtained for those three limbs.


1.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information. Choose one of the mirrors and download the files at: http://sunai.uoc.edu/chalearnLAP, tab "Track Information"

The data is organized as a set of sequences, each one unically identified by an string SeqXX, where XX is a 2 integer digit number. Each sequence is provided as a single ZIP file named with its identifier (eg. SeqXX.zip).

Each sample ZIP file contains the following files:

  • /imagesjpg: Set of RGB images composing the sequence. Each rgb file name denotes the sequence and number of frame of the image (XX_YYYY.jpg denotes the YYYY frame at the XX sequence)
  • /masksjpg: For each RGB image in the /imagesjpg folder we define 14 binary masks which denote the region in which a certain limb is positioned. Each binary mask file name follows the pattern XX_YYYY_W_Z.jpg, where XX denotes the sequence, YYYY denotes the frame, W denotes the actor in the sequence (1 if its at the left part of the image, 2 if its at the right part) and Z denotes the limb number (following the ordering defined in the figure above)


1.4 Data access scripts:

In the file ChalearnLAPSample.py there is a class ActionSample that allows to access all information from a sample. In order to open a sample file, use the constructor with the ZIP file you want to use:

>> from ChalearnLAPSample import PoseSample

>> poseSample = PoseSample("SeqXX.zip")

With the given object you can access to the sample general information. For instance, get the number of frames, the fps or the max depth value:

>> numFrames=actionSample.getNumFrames()

Additionaly we can access to any information of any frame. For instance, to access the RGB information for the 10th frame, we use:

>> rgb=poseSample.getRGB(10)

To visualize information of a frame, you can use this code:

import cv2
from ChalearnLAPSample import poseSample

poseSample = PoseSample("Seqxx.zip")
actorid=1
limbid=2


cv2.namedWindow("Seqxx",cv2.WINDOW_NORMAL)
cv2.namedWindow("Torso",cv2.WINDOW_NORMAL)

for x in range(1, poseSample.getNumFrames()):
img=poseSample.getRGB(x)
   torso=poseSample.getLimb(x,actorid,limbid)

cv2.imshow("Seqxx",img)
   cv2.imshow("Torso",torso)

cv2.waitKey(1)
del poseSample
cv2.destroyAllWindows()


1.5 Evaluation scripts:

In the file ChalearnLAPEvaluation.py there are some methods for evaluation. The first important script allows to export the labels of a set of frames into a ground truth folder, to be used to get the final ovelap value. Let's assume that you use the sequences 1 to 3 for validation purposes, and have a folder valSamples with the files Seq01.zip to Seq03.zip as you downloaded from the training data set. We can create a ground truth folder gtData using:

>> from ChalearnLAPEvaluation import exportGT_Pose

>> exportGT_Pose(valSamples,gtData)

This method exports the label files and data files for each sample in the valSample folder to the gtData folder. This new ground truth folder will be used by evaluation methods.

For each RGB image, we need to store the binary mask predictions in JPG files in the same format as the ground truth are provided. That is a JPG binary file for each limb category at each RGB and for each actor. This file must be named XX_YYYY_W_Z_prediction.jpg where XX denotes the sequence, YYYY denotes the frame, W denotes the actor in the sequence (1 if its at the left part of the image, 2 if its at the right part) and Z denotes the limb number. To make it easy, the class PoseSample allows to store this information for a given sample. Following the example from last section, we can store the predictions for sample using:

>> from ChalearnLAPSample import PoseSample

>> poseSample = PoseSample("SeqXX.zip")

Now, if our predictions are that we have not detected the head (limbid = 1) for the first actor in the scene in the frame of the sequence, and we want to store predictions in a certain folder valPredict, we can use the following code:

>> actionSample = poseSample("SeqXX.zip")

>> im1=numpy.zeros((360,480))
                               
>> poseSample.exportPredictions(im1,framenumber=1,actorid=1,limbid=1,valPredict) 

Assuming previous defined paths and objects, to evaluate the overlap for a single labeled sample prediction, that is, prediction for a sample from a set where labels are provided, we can use:

>> overlap=poseSample.evaluate(valPredict)

Finally, to obtain the final score for all the predictions, in the same way performed in the Codalab platform, we use:

>> from ChalearnLAPEvaluation import exportGT_Pose

>> score=evalPose(valPredict,predData)




2. Challenge description and rules for Track 2

The focus of this track is on action/interaction recognition on RGB data, providing for training a labeled database of 235 action performances from 17 users corresponding to 11 action categories: Wave, Point, Clap, Crouch, Jump, Walk, Run, Shake Hands, Hug, Kiss, and Fight.



2.1 Track stages:
  • Development phase: Create a learning system capable of learning from several training annotated human limbs a human action recognition problem. Practice with development data (a database of 150 manually labelled action performances is available) and submit predictions on-line on validation data (90 labelled action performances) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose. 

  • Final evaluation phase: Make predictions on the new final evaluation data (95 performances) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.
We highly recommend that the participants take advantage of this opportunity and upload regularly updated versions of their code during the development period. Their last code submission before deadline will be used for the verification.


2.2 Evaluation Measurement:

The metrics for the Chalearn LAP 2014 Track 2: Action Recognition on RGB challenge will follow the trend in Track 1, evaluating the recognition performance (action/interaction spotting evaluation) using the Jaccard Index. Thus, for each one of the n≤11 action categories labelled for each RGB sequence s, the Jaccard Index is defined as follows:



where A_(s,n) is the ground truth of action n at sequence s, and B_(s,n) is the prediction for such an action at sequence s. A_(s,n) and B_(s,n) are binary vectors where 1-value entries denote frames in which the n-th action is being performed. 

In the case of false positives (e.g predicting an action not labelled as ground truth), the Jaccard Index will be automatically 0 for that action prediction and such an action class will count in the mean Jaccard Index computation. In other words n equals the intersection of action categories appearing in the ground truth and in the predictions.

Participants will be evaluated based on the mean Jaccard Index among all action categories for all sequences, where all action categories are independent but not mutually exclusive (in a certain frame more than one action class can be active). In addition, when computing the mean Jaccard Index all action categories will have the same importance. Finally, the participant with the highest mean Jaccard Index will be the winner.

An example of the calculation for a single sequence and two action categories is show next.
As shown in the top of the figure, in the case of action/interaction spotting, the ground truth annotations of different action categories can overlap (appear at the same time within the sequence). Also, note that if different actors appear within the sequence at the same time, actions are labelled in the corresponding periods of time (that may overlap) but without needing to identify the actors in the scene, just the 11 action/interaction categories.


This example shows the mean Jaccard Index calculation for different instances of actions categories in a sequence (single red lines denote ground truth annotations and double red lines denote predictions). In the top part of the image one can see the ground truth annotations for actions walk and fight at sequence s. In the center part of the image a prediction is evaluated obtaining a Jaccard Index of 0.72.  In the bottom part of the image the same procedure is performed with the action fight and the obtained Jaccard Index is 0.46. Finally, the mean Jaccard Index is computed obtaining a value of 0.59.


2.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information. Choose one of the mirrors and download the files at: http://sunai.uoc.edu/chalearnLAP, tab "Track Information"

The data is organized as a set of sequence, each one unically identified by an string SeqXX, where XX is a 2 integer digit number. Each sequence is provided as a single ZIP file named with its identifier (eg. SeqXX.zip).

Each sample ZIP file contains the following files:

  • SeqXX_color.mp4: Video with the RGB data.
  • SeqXX_data.csv: CSV file with the number of frames of the video.
  • SeqXX_labels: CSV file with the ground truth for the sample (only for labelled data sets). Each line corresponds to an action instance. Information provided is the actionID, the start frame and the end frame of the action instance. The actions identifiers are the ones provided in the gesture table at the begining of this page.


2.4 Data access scripts:

In the file ChalearnLAPSample.py there is a class ActionSample that allows to access all information from a sample. In order to open a sample file, use the constructor with the ZIP file you want to use:

>> from ChalearnLAPSample import ActionSample

>> actionSample = ActionSample("SeqXX.zip")

With the given object you can access to the sample general information. For instance, get the number of frames, the fps or the max depth value:

>> numFrames=actionSample.getNumFrames()

Additionaly we can access to any information of any frame. For instance, to access the RGB information for the 10th frame, we use:

>> rgb=actionSample.getRGB(10)

To visualize all the information of a sample, you can use this code:

import cv2
from ChalearnLAPSample import ActionSample

actionSample = ActionSample("Seqxx.zip")
cv2.namedWindow("Seqxx",cv2.WINDOW_NORMAL)
for x in range(1, actionSample.getNumFrames()-1):
img=actionSample.getRGB(x)
cv2.imshow("Seqxx",img)
cv2.waitKey(1)
del actionSample
cv2.destroyAllWindows()


2.5 Evaluation scripts:

In the file ChalearnLAPEvaluation.py there are some methods for evaluation. The first important script allows to export the labels of a set of samples into a ground truth folder, to be used to get the final ovelap value. Let's assume that you use the sequences 1 to 3 for validation purposes, and have a folder valSamples with the files Seq01.zip to Seq03.zip as you downloaded from the training data set. We can create a ground truth folder gtData using:

>> from ChalearnLAPEvaluation import exportGT_Action

>> exportGT_Action(valSamples,gtData)

This method exports the label files and data files for each sample in the valSample folder to the gtData folder. This new ground truth folder will be used by evaluation methods.

For each sample, we need to store the actions predictions in a CSV file in the same format that labels are provided, that is, a line for each action instance with the actionID, the initial frame and the final frame. This file must be named as Seqxx_predictions.csv. To make it easy, the class ActionSample allows to store this information for a given sample. Following the example from last section, we can store the predictions for sample using:

>> from ChalearnLAPSample import ActionSample

>> actionSample = ActionSample("SeqXX.zip")

Now, if our predictions are that we have the action 1 from frame 102 to 203 and action 5 from frame 250 to 325, and we want to store predictions in a certain folder valPredict, we can use the following code:

>> actionSample = ActionSample("SeqXX.zip")

>> actionSample.exportPredictions(([1,102,203], [5,250,325]),valPredict)

Assuming previous defined paths and objects, to evaluate the overlap for a single labeled sample prediction, that is, prediction for a sample from a set where labels are provided, we can use:

>> overlap=actionSample.evaluate(gtData)

Finally, to obtain the final score for all the predictions, in the same way performed in the Codalab platform, we use:

>> from ChalearnLAPEvaluation import exportGT_Action

>> score=evalAction(valPredict,gtData)




3. Challenge description and rules for Track 3

The focus of this track is on “multiple instance, user independent spotting” of gestures, which means learning to recognize gestures from several instances for each category performed by different users, drawn from a gesture vocabulary of 20 categories. A gesture vocabulary is a set of unique gestures, generally related to a particular task. In this challenge we will focus on the recognition of a vocabulary of 20 Italian cultural/anthropological signs.



3.1 Track stages:
  • Development phase: Create a learning system capable of learning from several training examples a gesture classification problem. Practice with development data (a large database of 7,754 manually labeled gestures is available) and submit predictions on-line on validation data (3,362 labelled gestures) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose. 

  • Final evaluation phase: Make predictions on the new final evaluation data (2,742 gestures) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.
We highly recommend that the participants take advantage of this opportunity and upload regularly updated versions of their code during the development period. Their last code submission before deadline will be used for the verification.

The development data contains the recording of RGB-Depth data and user mask and skeleton information of 7,754 gesture instances from a vocabulary of 20 gesture categories of Italian signs. Next we list the types of gestures, represented by a numeric label (from 1 to 20), together with the number of training performances recorded for each gesture in brackets:

1. ‘vattene’ (389)
2. 'vieniqui' (390)

3. 'perfetto’ (388)

4. 'furbo'
 (388)
5. 'cheduepalle'
 (389)
6. 'chevuoi'
 (387)
7. 'daccordo'
 (372)
8. 'seipazzo'
 (388)
9. 'combinato' (388)

10. 'freganiente' (387)

11. 'ok'
 (391)
12. 'cosatifarei' (388)

13. 'basta' (388)

14. 'prendere' (390)

15. 'noncenepiu'
 (383)
16. 'fame' (391)
17. 'tantotempo' (389)
18. 'buonissimo' (391)
19. 'messidaccordo' (389)
20. 'sonostufo' (388)

These labels constitute the ground-truth and is provided with the development data, for which a soft segmentation annotation has been performed.

Then, in the data used for evaluation (called validation data and final evaluation data with 3,362 and 2,742 Italian gestures, respectively), you get several video clips with annotated gesture labels for training purposes. Multiple gesture instances from several users will be available. You must predict the labels of the gestures played in the other unlabeled videos.


3.2 Evaluation Measurement
:

The metrics for the Chalearn LAP 2014 Track 3: Multimodal Gesture Recognition will follow the trend in Track 1 and Track 2, evaluating the recognition performance  (for gesture spotting evaluation) using the Jaccard Index in the same manner as in Track 2. In this sense, for each one of the n≤20 gesture categories labelled for each RGBD sequence s, the Jaccard Index is defined as follows:



where A_(s,n) is the ground truth of gesture n at sequence s, and B_(s,n) is the prediction for such an gesture at sequence s. As in Track 2, A_(s,n) and B_(s,n) are binary vectors where 1-value entries denote frames in which the n-th gesture is being performed. 

In the case of false positives (e.g predicting a gesture that is not annotated in the ground truth), the Jaccard Index will be automatically 0 for that gesture prediction and that gesture class will count in the mean Jaccard Index computation. In other words n equals the intersection of gesture categories appearing in the ground truth and in the predictions.

As in Track 2, participants will be evaluated based on the mean Jaccard Index among all
gesture categories for all sequences, where all gesture categories are independent but not mutually exclusive (in a certain frame more than one gesture class can be active). In addition, when computing the mean Jaccard Index all gesture categories will have the same importance. Finally, the participant with the highest mean Jaccard Index will be the winner. Please see the example provided in the evaluation measurement section in Track 2.


3.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information. Choose one of the mirrors and download the files at: http://sunai.uoc.edu/chalearnLAP, tab "Track Information"

The data is organized as a set of samples, each one unically identified by an string SampleXXXX, where XXXX is a 4 integer digit number. Each sample is provided as a single ZIP file named with its identifier (eg. SampleXXXX.zip). Each sample ZIP file contains the following files:

  • SampleXXXX_color.mp4: Video with the RGB data.
  • SampleXXXX_depth.mp4: Video with the Depth data.
  • SampleXXXX_user.mp4: Video with the user segmentation mask.
  • SampleXXXX_data.csv: CSV file with general information about the video (number of frames, fps, and the maximum depth value.
  • SampleXXXX_skeleton.mp4: CSV with the skeleton information for each frame of the viedos. Each line corresponds to one frame. Skeletons are encoded as a sequence of joins, providing 9 values per join [Wx, Wy, Wz, Rx, Ry, Rz, Rw, Px, Py] (W are world coordinats, R rotation values and P the pixel coordinats). The order of the joins in the sequence is: 1.HipCenter, 2.Spine, 3.ShoulderCenter, 4.Head,5.ShoulderLeft, 6.ElbowLeft,7.WristLeft, 8.HandLeft, 9.ShoulderRight, 10.ElbowRight, 11.WristRight, 12.HandRight, 13.HipLeft, 14.KneeLeft, 15.AnkleLeft, 16.FootLeft, 17.HipRight, 18.KneeRight, 19.AnkleRight, and 20.FootRight.
  • SampleXXXX_labels: CSV file with the ground truth for the sample (only for labelled data sets). Each line corresponds to a gesture. Information provided is the gestureID, the initial frame and the last frame. The gesture identifiers are the ones provided in the gesture table at the begining of this page.

3.4 Data access scripts:

In the file ChalearnLAPSample.py there is a class GestureSample that allows to access all information from a sample. In order to open a sample file, use the constructor with the ZIP file you want to use:

>> from ChalearnLAPSample import GestureSample

>> gestureSample = GestureSample("SampleXXXX.zip")

With the given object you can access to the sample general information. For instance, get the number of frames, the fps or the max depth value:

>> numFrames=gestureSample.getNumFrames()

>> fps=gestureSample.getFPS()

>> maxDepth=gestureSample.getMaxDepth()

Additionally we can access to any information of any frame. For instance, to access the RGB, depth, and user segmentation information for the 10th frame, we use:

>> rgb=gestureSample.getRGB(10)

>> depth=gestureSample.getDepth(10)

>> user=gestureSample.getUser(10)

Finally, we can access to an object that encodes the skeleton information in the same way:

>> skeleton=gestureSample.getSkeleton(10)

To get the skeleton information, we have some provided functionalities. For each join the [Wx, Wy, Wz, Rx, Ry, Rz, Rw, Px, Py] description array is stored in a dictionary as three independent vectors. You can access each value for each join (eg. the head) as follows:

>> [Wx, Wy, Wz]=skeleton.getAllData()['Head'][0]

>> [Rx, Ry, Rz, Rw]=skeleton.getAllData()['Head'][1]

>> [Px, Py]=skeleton.getAllData()['Head'][2]

The same information can be retrieved using the especific methods:

>> [Wx, Wy, Wz]=skeleton.getWorldCoordinates()['Head']

>> [Rx, Ry, Rz, Rw]=skeleton.getJoinOrientations()['Head']

>> [Px, Py]=skeleton.getPixelCoordinates()['Head']

Additionally, some visualization functionalities are provided. You can get an image representation of the skeleton or a composition of all the information for a frame.

>> skelImg=gesture.getSkeletonImage(10)

>> frameData=gesture.getComposedFrame(10)

To visualize all the information of a sample, you can use this code:

import cv2
from ChalearnLAPSample import GestureSample

gestureSample = GestureSample("Samplexxxx.zip")
cv2.namedWindow("Samplexxxx",cv2.WINDOW_NORMAL)
for x in range(1, gestureSamle.getNumFrames()):
img=gestureSample.getComposedFrame(x)
cv2.imshow("Samplexxxx",img)
cv2.waitKey(1)
del gestureSample
cv2.destroyAllWindows()


3.5 Evaluation scripts:

In the file ChalearnLAPEvaluation.py there are some methods for evaluation. The first important script allows to export the labels of a set of samples into a ground truth folder, to be used to get the final ovelap value. Let's assume that you use the samples 1 to 10 for validation purposes, and have a folder valSamples with the files Sample0001.zip to Sample0010.zip as you downloaded from the training data set. We can create a ground truth folder gtData using:

>> from ChalearnLAPEvaluation import exportGT_Gesture

>> exportGT_Gesture(valSamples,gtData)

This method exports the label files and data files for each sample in the valSample folder to the gtData folder. This new ground truth folder will be used by evaluation methods.

For each sample, we need to store the gesture predictions in a CSV file in the same format that labels are provided, that is, a line for each gesture with the gestureID, the initial frame and the final frame. This file must be named as Samplexxxx_predictions.csv. To make it easy, the class GestureSample allows to store this information for a given sample. Following the example from last section, we can store the predictions for sample using:

>> from ChalearnLAPSample import GestureSample

>> gestureSample = GestureSample("SampleXXXX.zip")

Now, if our predictions are that we have the gesture 1 from frame 102 to 203 and gesture 5 from frame 250 to 325, and we want to store predictions in a certain folder valPredict, we can use the following code:

>> gestureSample = GestureSample("SampleXXXX.zip")

>> gestureSample.exportPredictions(([1,102,203], [5,250,325]),valPredict)

Assuming previous defined paths and objects, to evaluate the overlap for a single labeled sample prediction, that is, prediction for a sample from a set where labels are provided, we can use:

>> overlap=gestureSample.evaluate(([1,102,203], [5,250,325]))

Finally, to obtain the final score for all the predictions, in the same way performed in the Codalab platform, we use:

>> from ChalearnLAPEvaluation import exportGT_Gesture

>> score=evalGesture(valPredict,gtData)