sample-code

Baseline training

(single frames per video, without audio)


Take the first frame of each video and run it through a deep convolutional net, BLVC’s CaffeNet (link), that used to be used for category classification. Then, take the second to last layer output, which is a 1000 feature vector (for each video frame) and perform random forest regression. Then, use Mean Square Error, R^2, and mean accuracy as error metrics. 


The detailed baseline result instructions are on Github in this [repo]


The baseline results achieved are as follows:


Trait

Mean Square Error

R^2

Challenge Metric (meanAccs)

1

0.0192

0.2124

0.8711

2

0.0214

0.2843

0.8933

3

0.0199

0.2255

0.8623

4

0.0180

0.1321

0.8717

5

0.0213

0.2088

0.8760



Multi-modal baseline training

(including additional frames per video and audio features)


The multi-modal baseline consists of adding more features to the previous baseline from different channels. By incorporating additional video frames and the MFCC features, we include both small variability of the faces and the audio modality features for each videoclip.


For that consider, for each video, several (5) consecutive image frames, and aggregate them with the MFCC features for that video using fusion strategies. If the BLVC Caffe Net does not allow to compute the CNN filters over the early fusion of both image and audio modalities, one can perform late fusion simply by concatenating the audio features with the feature vector of the last layer output of the network (just before performing the regression). This multi-modal baseline allows, first, to see the result improvement when considering multimodal data and, second, to show participants a simple idea of how audio features can be included into the features obtained from the deep network.


MFCC features (link) consists of a multidimensional vector for each video. Here the sample code used (Python):



###############################################################

import sys,os,shutil,time
from features import mfcc
from features import logfbank
import scipy.io.wavfile as wav
from subprocess import call
# Read Training Video Files
os.chdir(<pathToTrainingVideos>)
fnames = [f for f in os.listdir(".") if f.endswith('.mp4')]
len(fnames)
# Generate audio files from video Files
for f in fnames:
    f_out = f[:-4]+".wav"
    if not os.path.isfile(f_out):
        command = "ffmpeg -i "+f+" -ab 160k -ac 2 -ar 44100 -vn "+f_out
        print "Building file "+f_out+" ..."
        call(command, shell=True)
# Compute audio filter-bank MFCC features and append them to a list
os.chdir(<pathToWavFiles>)
fnames = [f for f in os.listdir(".") if f.endswith('.wav')]
len(fnames)
audiofeatures = []

# start_time = time.time()
for f in fnames:
    (rate,sig) = wav.read(f)

    # Compute mfcc features
    mfcc_feat = mfcc(sig,rate)
    fbank_feat = logfbank(sig,rate)

    # print mfcc_feat
    # print fbank_feat[1:3,:]

    audiofeatures.append(fbank_feat)
# print time.time() - start_time

############################################################


Usually, a logarithm filter-bank energy is applied over these features, and hence the size of the feature vector slightly increases.


< costs when aggregating these additional features for training / regression? >


< brief description of final results obtained with the validation set, including additional frames and audio features >



References:


[1] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, Gang Hua. A Convolutional Neural Network Cascade for Face Detection. CVPR 2015.

[2] Shima Alizadeh, Azar Fazel. Convolutional Neural Networks for Facial Expression Recognition. Standford Tech. report 2016.

[3] Dimitri Palaz, Mathew Magimai.-Doss, Ronan Collobert. Analysis of CNN-based Speech Recognition System using Raw Speech as Input. Interspeech 2015.

Comments