Challenge and Data description

If your methodology is applied to the datasets mentioned below and your results are published the proper reference is: X. Baró, J. Gonzàlez, J. Fabian, M.A. Bautista, M. Oliu, H.J. Escalante, I. Guyon, S. Escalera, "ChaLearn Looking at People 2015 challenges: action spotting and cultural event recognition", CVPR Workshops, 2015.

Please review the NEW RULES 2015. In addition, the following terms of use apply for Cultural Event Recognition data: The data are available only for open research and educational purposes, within the scope of the challenge. ChaLearn and the organisers make no warranties regarding the database, including but not limited to warranties of non-infringement or fitness for a particular purpose. The copyright of the images remain the property of their respective owners. By downloading and making use of the data, you accept full responsibility for using the data. You shall defend and indemnify ChaLearn and the organisers, including their employees, Trustees, officers and agents, against any and all claims arising from your use of the data. You agree not to redistribute the data without this notice.

We provide scripts in Python to access the data and evaluate your predictions. The requirements are:

OpenCV 2.4.8
Python Imaging Library (PIL) 1.1.7
NumPy 1.8.0

The following files are common to all the tracks:

Code with the classes and methods to access the data.
Code for the evaluation methods.

The information and specific source code for each track can be downloaded from the competition server found at the Codalab platform, please check the resources and submission procedure for Track 1, the resources and submission procedure for Track 2, and the resources and submission procedure for Track 4. For each Track, the awards for the first, second and third winners will consist of 500, 300 and 200 dollars, respectively.

[Track 1: Human Pose Recovery] [Track 2: Action Recognition] [Track 4: Cultural Event Classification]

To verify that the participants complied with the rule that there should be no manual labeling of the test data, the top ranking participants eligible to win prizes will be asked to cooperate with the organizers to reproduce their results.

During the development period the participants can upload executable code reproducing their results together with their submissions. The organizers will evaluate requests to support particular platforms, but do not commit to support all platforms. The sooner a version of the code is uploaded, the highest the chances that the organizers will succeed in running it on their platform. The burden of proof will rest on the participants.

The code will be kept in confidence and used only for verification purpose after the challenge is over. The code submitted will need to be standalone and in particular it will not be allowed to access the Internet. It will need to be capable of training models from the final evaluation data training examples, for each data batch, and making label predictions on the test examples of that batch.

Contact: In order to clarify any doubt or to ask general assistance, you can contact the organizers at gesturechallenge@googlegroups.com or use the forum on the Codalab data description pages.

1. Challenge description and rules for Track 1

The focus of this track is on automatic human multi-limb pose recovery on RGB data. We provide 9 sequences with 14 labelled limbs per person and frame, including 124,761 labelled human limbs for more than 8,000 frames. For each frame, we provide the RGB image and 14 binary masks corresponding to the 14 labelled limbs (if visible): Head, Torso, R-L Upper-arm, R-L Lower-arm, R-L Hand, R-L Upper-leg, R-L Lower-leg, and R-L Foot. For each binary mask, 1-valued pixels indicate the region in which the limb is contained.

Both competition server and the resources for participating in this Track can be found here: https://www.codalab.org/ChalearnLAP_pose

1.1 Track stages:

Development phase: Create a learning system capable of learning from several training annotated human limbs a body pose recovery problem. Practice with development data (a large database of 4,000 manually labelled frames is available) and submit predictions on-line on validation data (2,000 labelled frames) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose.
Final evaluation phase: Make predictions on the new final evaluation data (2,234 frames) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.

We highly recommend that the participants take advantage of this opportunity and upload regularly updated versions of their code during the development period. Their last code submission before deadline will be used for the verification.

1.2 Evaluation Measurement:

The Jaccard Index (overlapping) will be used. Thus, for each one of the n=14 limbs labelled for each subject at frame i, the Jaccard Index is defined as follows:

where A_{i,n} is the ground truth of limb n, and B_{i,n} is the prediction for the same limb at image i. For the dataset in this challenge, both A_{i,n} and B_{i,n} are binary images where ‘1’ pixels denote the region in which the n-th limb is predicted. Particularly, since A_{i,n} (ground truth) is a binary image and 1-value pixels indicate the region of the n-th limb, this positive region does not necessarily need to be square. However, in all cases the positive region is a polyhedron defined by four points. Thus, numerator is the number ‘1’ pixels that intersects in both images A_{i,n} and B_{i,n}, and denominator is the number of union ‘1’ pixels after applying local or operator.

In the case of false positives (e.g predicting a limb that is not on the ground truth because of being occluded), the prediction will not affect the mean Hit Rate calculation. In other words n is computed as the intersection of the limb categories in the ground truth and the predictions.

Participant methods will be evaluated upon hit rate (HR) detection of limbs. That is, for each limb n at each image i, a hit will be computed if J_{i,n}≥0.5. Then, the mean hit rate among all limbs for all images will be computed (where all limb detections will have the same weight) and the participant with the highest mean hit rate will be the winner.

In some images a limb may not labelled in the ground truth because of occlusions. In that case where n<14, participants must not provide any prediction for that particular limb. An example of the mean hit rate calculation for an example of n=3 limbs and i=1 image is show next.

This figure shows the Mean hit rate and Jaccard Index calculation for a sample with n=3 limbs and i=1 image. In the top part of the image the Jaccard Index for the head limb is computed. As it is greater than 0.5 then it is computed as a hit for image i and the head limb. Similarly, for the torso limb the Jaccard Index obtained is 0.72 (center part of the image) which also computes as a hit for torso limb. In addition, in the bottom of the image the Jaccard Index obtained for the left thigh limb is shown, which does not compute as a hit since 0.04<0.5. Finally, the mean hit rate is obtained for those three limbs.

Participants must submit their predictions in the following format: For each RGB image XX_YYYY.jpg in the validation/test sequences participants should submit no more than 2 binary PNG images (one for each actor), where each PNG image has the following name structure XX_YYYY_W.png. Where, W denotes the actor in the sequence (W=1 if the leftmost and W=2 if the rightmost). This image is the concatenation of 14 BW indicator images (one per limb), in increasing order from limb id 1 to limb id 14.

1.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information.

The data is organized as a set of sequences, each one unically identified by an string SeqXX, where XX is a 2 integer digit number. Each sequence is provided as a single ZIP file named with its identifier (eg. SeqXX.zip).

Each sample ZIP file contains the following files:

/imagesjpg: Set of RGB images composing the sequence. Each rgb file name denotes the sequence and number of frame of the image (XX_YYYY.jpg denotes the YYYY frame at the XX sequence)
/masksjpg: For each RGB image in the /imagesjpg folder we define 14 binary masks which denote the region in which a certain limb is positioned. Each binary mask is a concatenation of 14 BW indicator images where the i-th image indicates the region of the i-th limb. In this sense, each mask is a panoramic BW image with size 360*(480*14). The order of concatenation is increasing from limb id 1 to 14. Each binary mask file name follows the pattern XX_YYYY_W.png, where XX denotes the sequence, YYYY denotes the frame, W denotes the actor in the sequence (1 if its at the left part of the image, 2 if its at the right part).

1.4 Data access scripts:

In the file ChalearnLAPSample.py there is a class ActionSample that allows to access all information from a sample. In order to open a sample file, use the constructor with the ZIP file you want to use:

>> from ChalearnLAPSample import PoseSample

>> poseSample = PoseSample("SeqXX.zip")

With the given object you can access to the sample general information. For instance, get the number of frames, the fps or the max depth value:

>> numFrames=actionSample.getNumFrames()

Additionaly we can access to any information of any frame. For instance, to access the RGB information for the 10th frame, we use:

>> rgb=poseSample.getRGB(10)

To visualize information of a frame, you can use this code:

import cv2

from ChalearnLAPSample import poseSample

poseSample = PoseSample("Seqxx.zip")

actorid=1

limbid=2

cv2.namedWindow("Seqxx",cv2.WINDOW_NORMAL)

cv2.namedWindow("Torso",cv2.WINDOW_NORMAL)

for x in range(1, poseSample.getNumFrames()):

img=poseSample.getRGB(x)

   torso=poseSample.getLimb(x,actorid,limbid)

cv2.imshow("Seqxx",img)

   cv2.imshow("Torso",torso)

cv2.waitKey(1)

del poseSample

cv2.destroyAllWindows()

1.5 Evaluation scripts:

In the file ChalearnLAPEvaluation.py there are some methods for evaluation. The first important script allows to export the labels of a set of frames into a ground truth folder, to be used to get the final ovelap value. Let's assume that you use the sequences 1 to 3 for validation purposes, and have a folder valSamples with the files Seq01.zip to Seq03.zip as you downloaded from the training data set. We can create a ground truth folder gtData using:

>> from ChalearnLAPEvaluation import exportGT_Pose

>> exportGT_Pose(valSamples,gtData)

This method exports the label files and data files for each sample in the valSample folder to the gtData folder. This new ground truth folder will be used by evaluation methods.

For each RGB image, we need to store the binary mask predictions in a PNG file in the same format as the ground truth is provided. That is a PNG binary file for each RGB image and for each actor. This file must be named XX_YYYY_W_prediction.jpg where XX denotes the sequence, YYYY denotes the frame, W denotes the actor in the sequence (1 if its at the left part of the image, 2 if its at the right part). To make it easy, the class PoseSample allows to store this information for a given sample. Following the example from last section, we can store the predictions for sample using:

>> from ChalearnLAPSample import PoseSample

>> poseSample = PoseSample("SeqXX.zip")

Now, if our predictions are that we have not detected the head (limbid = 1) for the first actor in the scene in the frame of the sequence, and we want to store predictions in a certain folder valPredict, we can use the following code:

>> actionSample = poseSample("SeqXX.zip")

>> im1=numpy.zeros((360,480))

>> poseSample.exportPredictions(im1,framenumber=1,actorid=1,limbid=1,valPredict)

Assuming previous defined paths and objects, to evaluate the overlap for a single labeled sample prediction, that is, prediction for a sample from a set where labels are provided, we can use:

>> overlap=poseSample.evaluate(valPredict)

Once all predictions for a sequence are stored in pathPredIn using the above functions, the user can generate the panoramic 360*(480*14) pixels images by using the following call:

>> poseSample.convertPredictions(pathPredIn, pathPredOut)

Finally, to obtain the final score for all the predictions, in the same way performed in the Codalab platform, we use:

>> from ChalearnLAPEvaluation import exportGT_Pose

>> score=evalPose(valPredict,predData)

2. Challenge description and rules for Track 2

The focus of this track is on action/interaction recognition on RGB data, providing for training a labeled database of 235 action performances from 17 users corresponding to 11 action categories: Wave, Point, Clap, Crouch, Jump, Walk, Run, Shake Hands, Hug, Kiss, and Fight.

Both competition server and the resources for participating in this Track can be found here: https://www.codalab.org/ChalearnLAP_Action

2.1 Track stages:

Development phase: Create a learning system capable of learning from several training annotated human limbs a human action recognition problem. Practice with development data (a database of 150 manually labelled action performances is available) and submit predictions on-line on validation data (90 labelled action performances) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose.
Final evaluation phase: Make predictions on the new final evaluation data (95 performances) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.

2.2 Evaluation Measurement:

Rercognition performance is evaluated using the Jaccard Index (overlap). Thus, for each one of the 11 action categories labelled for each RGB sequence s, the Jaccard Index is defined as follows:

where A_(s,n) is the ground truth of action n at sequence s, and B_(s,n) is the prediction for such an action at sequence s. A_(s,n) and B_(s,n) are binary vectors where 1-value entries denote frames in which the n-th action is being performed.

In the case of false positives (e.g predicting an action not labelled as ground truth), the Jaccard Index will be automatically 0 for that action prediction and such an action class will count in the mean Jaccard Index computation. In other words n equals the intersection of action categories appearing in the ground truth and in the predictions.

Participants will be evaluated based on the mean Jaccard Index among all action categories for all sequences, where all action categories are independent but not mutually exclusive (in a certain frame more than one action class can be active). In addition, when computing the mean Jaccard Index all action categories will have the same importance. Finally, the participant with the highest mean Jaccard Index will be the winner.

An example of the calculation for a single sequence and two action categories is show next. As shown in the top of the figure, in the case of action/interaction spotting, the ground truth annotations of different action categories can overlap (appear at the same time within the sequence). Also, note that if different actors appear within the sequence at the same time, actions are labelled in the corresponding periods of time (that may overlap) but without needing to identify the actors in the scene, just the 11 action/interaction categories.

This example shows the mean Jaccard Index calculation for different instances of actions categories in a sequence (single red lines denote ground truth annotations and double red lines denote predictions). In the top part of the image one can see the ground truth annotations for actions walk and fight at sequence s. In the center part of the image a prediction is evaluated obtaining a Jaccard Index of 0.72. In the bottom part of the image the same procedure is performed with the action fight and the obtained Jaccard Index is 0.46. Finally, the mean Jaccard Index is computed obtaining a value of 0.59.

Participants must submit their predictions in the following format: For each sequence SeqXX.zip in the data folder, participants should create a SeqXX_prediction.csv file with a line for each predicted action [ActorID, ActionID, StartFrame, EndFrame] (the same format as SeqXX_labels.csv). The csv file should NOT end with an empty line. The predictions csv files for each sequence will be put in a single ZIP file and submitted to Codalab.

2.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information.

The data is organized as a set of sequence, each one unically identified by an string SeqXX, where XX is a 2 integer digit number. Each sequence is provided as a single ZIP file named with its identifier (eg. SeqXX.zip).

Each sample ZIP file contains the following files:

SeqXX_color.mp4: Video with the RGB data.
SeqXX_data.csv: CSV file with the number of frames of the video.
SeqXX_labels: CSV file with the ground truth for the sample (only for labelled data sets). Each line corresponds to an action instance. Information provided is the actionID, the start frame and the end frame of the action instance. The actions identifiers are the ones provided in the gesture table at the begining of this page.

2.4 Data access scripts:

>> from ChalearnLAPSample import ActionSample

>> actionSample = ActionSample("SeqXX.zip")

With the given object you can access to the sample general information. For instance, get the number of frames, the fps or the max depth value:

>> numFrames=actionSample.getNumFrames()

Additionaly we can access to any information of any frame. For instance, to access the RGB information for the 10th frame, we use:

>> rgb=actionSample.getRGB(10)

To visualize all the information of a sample, you can use this code:

import cv2

from ChalearnLAPSample import ActionSample

actionSample = ActionSample("Seqxx.zip")

cv2.namedWindow("Seqxx",cv2.WINDOW_NORMAL)

for x in range(1, actionSample.getNumFrames()-1):

img=actionSample.getRGB(x)

cv2.imshow("Seqxx",img)

cv2.waitKey(1)

del actionSample

cv2.destroyAllWindows()

2.5 Evaluation scripts:

In the file ChalearnLAPEvaluation.py there are some methods for evaluation. The first important script allows to export the labels of a set of samples into a ground truth folder, to be used to get the final ovelap value. Let's assume that you use the sequences 1 to 3 for validation purposes, and have a folder valSamples with the files Seq01.zip to Seq03.zip as you downloaded from the training data set. We can create a ground truth folder gtData using:

>> from ChalearnLAPEvaluation import exportGT_Action

>> exportGT_Action(valSamples,gtData)

This method exports the label files and data files for each sample in the valSample folder to the gtData folder. This new ground truth folder will be used by evaluation methods.

For each sample, we need to store the actions predictions in a CSV file in the same format that labels are provided, that is, a line for each action instance with the actionID, the initial frame and the final frame. This file must be named as SeqXX_predictions.csv. To make it easy, the class ActionSample allows to store this information for a given sample. Following the example from last section, we can store the predictions for sample using:

>> from ChalearnLAPSample import ActionSample

>> actionSample = ActionSample("SeqXX.zip")

Now, if our predictions are that we have actor 1 performing the action 1 from frame 102 to 203, and then, actor 2 performing action 5 from frame 250 to 325, and we want to store predictions in a certain folder valPredict, we can use the following code:

>> actionSample = ActionSample("SeqXX.zip")

>> actionSample.exportPredictions(([1,1,102,203], [2,5,250,325]),valPredict)

Assuming previous defined paths and objects, to evaluate the overlap for a single labeled sample prediction, that is, prediction for a sample from a set where labels are provided, we can use:

>> overlap=actionSample.evaluate(gtData)

Finally, to obtain the final score for all the predictions, in the same way performed in the Codalab platform, we use:

>> from ChalearnLAPEvaluation import exportGT_Action

>> score=evalAction(valPredict,gtData)

3. Challenge description and rules for Track 4

The Cultural Event Recognition dataset consists of images collected from two images search engines (Google Images and Bing Images). For this particular track, we chose 50 important cultural events in the world. In all the image categories, garments, human poses, objects and context do constitute the possible cues to be exploited for recognizing the events, while preserving the inherent inter- and intra-class variability of this type of images.

Jordi Gonzàlez, Júnior J. Fabian and Josep M. Gonfaus gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for creating the baseline of this Track.

Both competition server and the resources for participating in this Track can be found here: https://www.codalab.org/competitions/2611

For this track, the 50 cultural events that have been selected are:

3.1 Track stages:

Development phase: Create a learning system capable of learning from several training annotated images a cultural event classification problem. Practice with development data (a large database of 5,875 manually labelled frames is available) and submit predictions on-line on validation data (2,332 labelled frames) to get immediate feed-back on the leaderboard. Recommended: towards the end of the development phase, submit your code for verification purpose.
Final evaluation phase: Make predictions on the new final evaluation data (3,569 frames) revealed at the end of the development phase. The participants will have few days to train their systems and upload their predictions.

3.2 Evaluation Measurement:

For the evaluation process, the output from your system should be a real-valued confidence so that a precision/recall curve can be drawn. Greater confidence values signify greater confidence that the image contains belongs to the class of interest. The classification task will be judged by the precision/recall curve. The principal quantitative measure used will be the average precision (AP). We use the area under this curve to the computation of the average precision (AP), which is calculated by numerical integration.

Participants must submit their predictions in the following format: In this track, the participants should submit a text file for each category e.g. `La_Tomatina'. Each line should contain a single identifier and the confidence output by the classifier, separated by a space, for example:

La_Tomatina.txt: ... 002234.jpg 0.056313 010256.jpg 0.127031 010987.jpg 0.287153 ...

3.3 Data download and description:

Access to the data is password protected. Register and accept the terms and conditions from Codalab competition server to get the authentication information.

Terms of use: The data of this track (cultural event classification) are available only for open research and educational purposes, within the scope of the challenge. ChaLearn and the organizers make no warranties regarding the database, including but not limited to warranties of non-infringement or fitness for a particular purpose. The copyright of the images remain the property of their respective owners. By downloading and making use of the data, you accept full responsibility for using the data. You shall defend and indemnify ChaLearn and the organizers, including their employees, Trustees, officers and agents, against any and all claims arising from your use of the data. You agree not to redistribute the data without this notice.

The dataset is composed by 50 cultural events, containing more than 11000 images manually labelled in total. The data has been split into 50% for training, 20% validation and 30% for testing. The distributions of images by category are approximately equal. In this first stage, we provide two ZIP files containing the images for the training and validation sets. Additionaly, we provide a file for each category with the labels for the training and validation images. The participants will evaluate their methods in the validation set. Then, in the second stage we will provide the test set to measure their final results.

For this track, class-specific image sets with per-image ground truth are provided. The file

<category>_<imgset>.txt

contains image identifiers and ground truth for a particular category and image set, for example the file San_Fermin_train.txt applies to the `San Fermin' cultural event and train image set. Each line of the file contains a single image identifier and ground truth label, separated by a space, for example:

... 001045.jpg -1 006547.jpg -1 012548.jpg 1 ...

There are two ground truth labels:

-1: Negative: The image does not belong to the category of interest. A classifier should give a `negative' output.

1: Positive: The image belongs to the class of interest. A classifier should give a `positive' output.

3.4 Data access and evaluation scripts:

For this track, we provide a file evaluate.py which contains the method for data access and evaluation. This script allows to get the final mean average precision value. To obtain the final mean average precision (mAP) for all the categories, in the same way performed in the Codalab platform, we must go to the main directory and run the next command:

>> python program/evaluate.py input output

Page updated

Report abuse