Deep Local Video Feature for Action Recognition ZhenzhongLan YiZhu AlexanderG. Hauptmann CarnegieMellonUniversity UniversityofCalifornia, Merced CarnegieMellonUniversity [email protected] [email protected] [email protected] 7 1 0 Abstract mismatch leads to the problem of false label assignment. 2 Inotherwords,theimpreciseframe/clip-levellabelspopu- n We investigate the problem of representing an entire latedfromvideolabelsaretoonoisytoguideprecisemap- a video using CNN features for human action recognition. ping from videos to labels. To deal with this problem, a J Currently,limitedbyGPUmemory,wehavenotbeenable commonpracticeistosamplemultipleframes/clipsfroma 8 tofeedawholevideointoCNN/RNNsforend-to-endlearn- videoattestingtimeandaggregatethepredictionscoresof 2 ing. Acommonpracticeistousesampledframesasinputs thesesampledframes/clipstogetthefinalpredictionforthat ] andvideolabelsassupervision. Onemajorproblemofthis video. However, simply averaging the prediction scores, V popularapproachisthatthelocalsamplesmaynotcontain withoutanotherlevelofmapping,isnotenoughtorecover C the information indicated by global labels. To deal with thedamagesbroughtbyfalselabelassignment. s. thisproblem,weproposetotreatthedeepnetworkstrained To further correct the damages made by false label as- c onlocalinputsaslocalfeatureextractors. Afterextracting signment,weproposetotreatthedeepnetworksthattrained [ localfeatures, we aggregatethem intoglobalfeaturesand on local inputs as feature extractors. After extracting lo- 2 trainanothermappingfunctiononthesametrainingdatato calfeaturesusingthesepre-trainednetworks,weaggregate v maptheglobalfeaturesintogloballabels.Westudyasetof them into global features and train another mappingfunc- 8 problemsregardingthisnewtype oflocalfeaturessuch as tion(shallownetwork)onthesametrainingdatatomapthe 6 howto aggregatethem into globalfeatures. Experimental globalfeaturesintogloballabels. 3 results on HMDB51 and UCF101 datasets show that, for 7 Our method is very similar to the fine-tuning practices 0 thesenewlocalfeatures,asimplemaximumpoolingonthe that have been popularin image classification. The major . sparsely sampled features lead to significant performance 1 difference is that the data we used to train our feature ex- improvement. 0 traction networks are local and the label are noisy due to 7 the false label assignment. Therefore, we heavily rely on 1 theshallownetworktocorrectthemistakeswemadeonlo- : 1. Introduction v calfeaturelearning. i X Given the success of deep neural networks on image OurmethodisalsosimilartothepracticesofusingIm- r classification,wehavebeenhopingthatitcanachievesim- ageNet pre-trained networks to extract frame-level (local) a ilar improvement on video classification. However, after featuresforvideoclassification[20,6].Theonlydifference severalyears’ effort, the hope remainselusive. The major hereisthatourlocalfeatureextractors(deepnetworks)are obstacles,webelieve,lieintwomajordifferencesbetween trainedonthesametrainingdataastheclassifiers(shallow the images and videos. Compared to images, videos are networks). However, this simple fine-tuninghas greatim- often much larger in size and video labels are often much pact on the local features. First, fine-tuning on the video moreexpensivetoget. Largedatasizemeansthatitisdif- datanarrowsthedomain-gapbetweenfeatureextractorsand ficult to feed a whole video into modern deep CNN/RNN thetargetdata.Second,sinceweusethesametrainingdata architectures that often have large memory demands. Ex- twice,thelocalfeaturesweextractedcanbeseverelyover- pensivevideolabelingbringsdifficultiesingettingenough fittedtothetrainingdata. labeled data to train a robust network. Recent approaches A common practice to deal with the overfitting prob- [11,17,18]circumventtheseproblemsbylearningonsam- lem is to use cross-validation. However, in this particular pled frames or very short video clips (local inputs) with case, we cannot use this method. This is largely because videolevel(global)labels. that we have already had the problem of lacking training However, video level label information can be incom- dataandcannotaffordtoloseanymoretrainingdata. Itis pleteorevenmissingatframe/cliplevel. Thisinformation also because of the difficulties in calibrating features gen- 1 eratedfromdifferentmodels. Therefore,we choosetouse andgeneratevocabulariesforDTfeatures. Hoai&Zisser- thewholetrainingsettotrainthelocalfeatureextractorsin man [3] achieved superior performance on several action thehopethatthoselayersthatarefarawayfromtheproba- recognition datasets by using three techniques including bilitylayerswouldcapturegeneralvideoinformationhence data augmentation,modelingscore distributionovervideo generalizableenoughforfurtherclassifierstraining. subsequences,andcapturingtherelationshipamongaction We call this new type of local video features as Deep classes. Fernando et al. [2] modeled the evolution of ap- lOcalVideoFeatures(DOVF). pearance in the video and achieved state-of-the-art results Insummary,DOVFisakindoflocalvideofeaturesthat ontheHollywood2dataset.[7]proposedtoextractfeatures are extracted from deep neural networks trained on local fromvideoswithmultipleplaybackspeedstoachievespeed videoclipsandglobalvideolabels.Themajorproblemswe invariances. However,with the arising of deepneuralnet- wouldliketoinvestigateaboutDOVFare: workmethods,thesetraditionalmethodsaregraduallyfor- gotten. • Whichlayer(s)offeaturesshouldweextract? Without Motivated by this success of CNNs, researchers are furtherinvestigation,theonlythingweknowisthatwe workingintenselytowardsdevelopingCNNequivalentsfor cannotutilizetheprobabilitylayerasitseverelyover- learning video features. Several accomplishments have fitstothenoisylabelsandwouldhavealargedistribu- been reported from using CNNs for action recognition in tiondifferencebetweenfeaturesoftrainingandtesting videos[21,19,12]. Karpathyetal. [5]traineddeepCNNs sets. through one million weakly labelled YouTube videos and • How to aggregate the local features into global fea- reported moderate success while using it as a feature ex- tures? We will test variousfeatureaggregationmeth- tractor. Simonyan&Zisserman[11]demonstratedaresult odssuchasmeanpoolingandFisherVectors(FV). competitivetoIDT[15]throughtrainingdeepCNNsusing both sampled frames and stacked optical flows. Wang et • How dense should we extract the local features? In al. [16,17,18]showmultipleinsightfulanalysisabouthow practice,wewouldprefersparsetemporalsamplingas to improvetwo-streamframeworksandfind severaluseful itwouldbemuchmoreefficient. observations including pre-training two-stream ConvNets, usingsmallerlearningrate,andusingdeepernetworks,etc. • HowcomplementaryareDOVFtothetraditionallocal Withtheseobservations,theyfinallyoutperformsIDT[15] features such as IDT[15]? The more complementary by a large marginon UCF101 dataset. However, all these they are, the more room we can improveby incorpo- approaches rely on shot-clip predictions to get the final ratingmethodologiesthatwedevelopedfortraditional videoscoreswithoutusingglobalfeatures. localfeatures. Atthetimewewrotethispaper,twosimilarwork[1,9] In the remainder of this paper, we first provide more have been published on Arxiv. Both of them propose a backgroundinformation about video features with an em- new feature aggregationmethod to pool together the local phasisonrecentattemptsonlearningwithdeepneuralnet- neuralnetworkfeaturesintoglobalvideofeatures. Dibaet works. We then describe our experimental settings in de- al. [1] proposesa bilinear modelto pooltogetherthe out- tail. Afterthat,weevaluateourmethodsonHMDB51and putsoflastconvolutionallayersofthepre-trainednetworks UCF101 datasets. Further discussions including potential and achieve state-of-the-art results on both HMDB51 and improvementsaregivenattheend. UCF101 datasets. Qiu et al. [9] proposes a new quan- tization method that is similar to FV and achieves similar 2. Related works performanceas [1]. However, neitherof themprovidede- tailedanalysisofthelocalneuralnetworkfeaturestheyhave Newvideorepresentationmethodsarethemajorsources used. In this paper, we performa more extensiveanalysis ofbreakthroughsforvideoclassification. andshowthatasimplemaxpoolingcanachievesimilaror Intraditionalvideorepresentations,trajectorybasedap- better results compared to those much more complex fea- proaches[15,4],especiallytheDenseTrajectory(DT)and tureaggregationmethodsasin[1,9]. IDT[14,15],arethebasisofcurrentstate-of-the-arthand- craftedalgorithms. Thesetrajectory-basedmethodsarede- 3.Experimental settings signed to address the flaws of image-extended video fea- tures. Their superior performancevalidatesthe need for a Forlocalfeatureextraction,weutilizebothVGG16and uniquerepresentationofmotionfeatures. Therehavebeen Inception-BNthatweretrainedbyWangetal. [17,18]. We many studies attempting to improve IDT due to its popu- extracttheoutputsofthelastfivelayersasourfeaturesfrom larity. Peng et al. [8] enhanced the performance of IDT bothnetworks. Table1showsthelayernamesofbothnet- by increasing codebook sizes and fusing multiple coding worksandthecorrespondentfeaturedimensions. We clas- methods. Sapienzaetal. [10]exploredwaystosub-sample sifytheselayersintotwoclasses:full-connected(FC)layers VGG16 Inception-BN ID Name Dimensions Type Name Dimension Type L-1 fc8 101 FC fc-action 101 FC L-2 fc7 4096 FC global pool 1024 Conv L-3 fc6 4096 FC inception 5b 50176 Conv L-4 pool5 25088 Conv inception 5a 50176 Conv L-5 conv5 3 100352 Conv inception 4e 51744 Conv Table1.Layersnameanddimensionsofthoselayersthatweexam SpatialConvnets(%) TemporalConvnets(%) Two-stream(%) Layers VGG-16 Inception-BN VGG16 Inception-BN VGG-16 Inception-BN L-1 77.8 83.9 82.6 83.7 89.6 91.7 L-2 79.5 88.3 85.1 88.8 91.4 94.2 L-3 80.1 88.3 86.6 88.7 91.8 93.9 L-4 83.7 85.6 86.5 85.3 92.4 91.4 L-5 83.5 83.6 87.0 83.6 92.3 89.8 TSN[18] 79.8 85.7 85.7 87.9 90.9 93.5 Table2.Layer-wisecomparisonofVGG-16andInception-BNnetworksonthesplit1ofUCF101 andconvolution(Conv)layers(poolinglayersaretreatedas 4.Evaluation Convlayer). FClayershavemuchmoreparametershence In this section, we experimentallyanswer the questions aremucheasiertooverfittothenoisylabelsthanConvlay- we raised in the introduction section using the results of ers.Asshown,VGG16hasthreeFClayerswhileInception- both UCF101 and HMDB51 datasets. By default, we will BNonlyhasone. usetheoutputsofglobal poollayerfromInception-BNnet- workasourfeaturesandusemaximumpoolingtoaggregate Followingthetestingschemeof[11,18],weevenlysam- theselocalfeaturesintoglobalfeatures. ple 25 frames and flow clips for each video. For each frame/clip, we do 10x data augmentation by cropping the 4.1.Whichlayer(s)offeaturesshouldweextract? 4cornersand1centerandperformingtheirhorizontalflip- ping from the cropped frames. After getting the features To find out which (type of) layers we need to extract, fromtheaugmenteddataofeachframe/clip,weaveragethe weconductexperimentsonbothVGG16andInception-BN features to get the local feature for that frame/clip. In the andshowtheresultsofsplit1ofUCF101inTable2. end,foreachvideo,wegetatotalof25localfeatures. The Table 2 shows that the the L-2 layer for Inception-BN dimensionsofeachlocalfeaturesareshowninTable1. and L-4 layer for VGG16 give the best performance. One common characteristic of these two layers is that they are We use severallocalfeatureaggregationmethodsrang- the last convolution layers in both networks. There are ingfromsimplemean,maximumpoolingtomorecomplex threepotentialreasonsfor thesuperiorperformanceofthe feature encoding methods such as Bag of words (BoW), last convolution layers. First, compared to the fully con- Vector of Locally Aggregated Descriptors (VLAD) and nectedlayers,theconvolutionlayershavemuchlessparam- Fisher Vector (FV) encoding. To incorporateglobal tem- eters, hence are much less likely to overfit to the training poralinformation,wedivideeachvideointothreepartsand data that has false label assignment problem. Second, the perform the aggregation for each part separately. For ex- fully-connectedlayersdonotpreservethespatialinforma- ample, to aggregatethe 25 localfeaturesof onevideo, we tion while those convolution layers do. Third, compared aggregatethefirst8,middle9,andthelast8localfeatures to other convolution layers that are further away from the separately and concatenate the aggregated features to get probability layers, these layers contain more global infor- theglobalfeature. mation. Therefore, in terms of which layer(s) to extract, we suggest to extract the last convolution layer and avoid To map globalfeaturesintogloballabels, we use SVM thosefull-connectedlayers. Theseresultsmayjustifywhy withChi2kernelandafixedC=100asin[7]exceptforFV theserecentworks[20,1,9]choosetheoutputsofthelast andVLAD,whereweuselinearkernelassuggestedin[13]. convolutionlayersofthenetworksforfurtherprocessing. Togetthetwo-streamresults,wefusethepredictionscores ComparedtotheresultsofWangetal. [18],fromwhich of spatial-net and temporal-net with the weights of 1 and we get the pre-trained networks, we can see that our ap- 1.5respectively,asin[18]. proachdoimprovetheperformanceonbothspatial-netand SpatialConvnets(%) TemporalConvnets(%) Two-stream(%) Layers HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101 Mean 56.0 87.5 63.7 88.3 71.1 93.8 Mean Std 58.1 88.1 65.2 88.5 72.0 94.2 Max 57.7 88.3 64.8 88.8 72.5 94.2 BoW 36.9 71.9 47.9 80.0 53.4 85.3 FV 39.1 69.8 55.6 81.3 58.5 83.8 VLAD 45.3 77.3 57.4 84.7 64.7 89.2 Table3.Comparisonofdifferentlocalfeatureaggregationmethodsonthesplit1ofUCF101andHMDB51 #of SpatialConvnets(%) TemporalConvnets(%) Two-stream(%) samples HMDB51 UCF101 HMDB51 UCF101 HMDB51 UCF101 3 52.5 85.6 54.9 82.4 64.6 91.6 9 56.1 87.4 62.2 87.7 70.9 93.5 15 56.9 88.2 64.4 88.5 72.3 93.8 21 57.1 88.1 64.8 88.6 71.8 94.1 25 57.7 88.3 64.8 88.8 72.5 94.2 Max 57.6 88.4 65.3 88.9 72.4 94.3 Table4.Numberofsamplesversusaccuracies temporal-net.However,theimprovementsfromspatialnet- dimensionofthenlocalfeatures. works are much larger. This larger improvement may be • BoW, models the distribution of local features using becausethat, intraininglocalfeatureextractors,theinputs k-meansandquantizesthemintothesekcentroids. forspatialnetaresingleframeswhiletheinputfortemporal netarevideoclipswith10stackedframes. Smallerinputs • VLAD, modelsthe distributionsoflocalfeaturesus- leadtolargerchanceoffalselabelassignmenthencelarger ingk-meansandmeasuresthemeanofthedifferences performancegapcomparedtoourglobalfeatureapproach. betweeneachlocalfeatureandthekcentroids. Previousworks[21,6, 20] onusinglocalfeaturesfrom • FV, models the distribution using GMMs with k ImageNet pre-trained networks show that combining fea- Gaussian and measuresthe mean and standarddevia- turesfrommultiplelayershelpto improvethe overallper- tionofaweighteddifferencesbetweeneachlocalfea- formance significantly. We perform a similar analysis but tureandthekGaussian. found no improvement. This difference shows that fine- tuningbringssomenewcharacteristicstothelocalfeatures. Forthosefeatureaggregationmethodsthatrequireclus- Inthefollowingexperiments,wewillonlyusetheoutput tering,weprojecteachlocalfeaturesinto256dimensionus- oftheglobal poollayerfrominception-BNnetwork,which ingPCAandthenumberofclustersforallencodingmeth- hasbetterperformancethanotherlayersinbothnetworks. odsare256,assuggestedin[20]. As can be seen in Table 3, maximum pooling (Max) 4.2.Howtoaggregatethelocalfeaturesintoglobal hassimilir or better performancecomparedto other meth- features? ods. Thisobservationisagaindifferentfrom[6],wherethe To determine which aggregation methods is better, we meanpooling(Mean)performsbetterthanmaximumpool- testsixaggregationmethodsonthesplit1ofbothUCF101 ing (Max). It is also interestingto find that Mean std is andHMDB51datasets. consistentlybetterthanMean.However,morecomplicated Assuming that we have n localfeatures, each of which encoding methods such as BoW, FV and VLAD are all hasa dimensionofd, the six differentlocalfeatureaggre- muchworsethansimplemeanpooling. Weconjecturethat gationmethodsaresummarizedasfollows: extractingmorelocalfeaturesforeachvideoandbreakeach localfeaturesintolowerdimensionasin[20]willimprove • Mean, takes mean of these n local featuresand pro- the results of these encodingmethods. However, it would ducesa1×ddimensionalglobalfeature. incurexcessivecomputationalcost,whichlimitsitsuseful- • Max, takesmaximumofthesenlocalfeaturesalong nessinpractice. eachdimension. 4.3.Howdenseweneedtodofeatureextraction? • Mean Std, inspired by the Fisher Vector encoding, In studying the number of samples needed for each we recordsbothmean andstandarddeviationof each video, we also use maximum pooling (Max) as our fea- HMDB51 UCF101 maximumpoolinggenerallyworkbetterthanotherfeature IDT[15] 57.2 85.9 aggregationmethods including those that need further en- MIFS[7] 65.1 89.1 coding; 3)a sparse samplingofmorethan15 frames/clips Two-stream[11] 59.4 88.0 per video is enough for maximum pooling. Although we TSN[18] 68.5 94.0 present some observations about this new local features DeepQuantization[9] - 94.2 DOVF, the reasons behindthese observationsneed further DeepQuantization[9](w/IDT) - 95.2 investigation. Also, the current two-stage approach only TLE[1] 71.1 95.6 correctsthemistakesafterithappens,webelievethatabet- DOVF(ours) 71.7 94.9 ter way wouldbe directlymappinga whole videointo the DOVF+MIFS(ours) 75.0 95.3 video label, or so called end-to-end learning. Our future Table5.Comparisontothestate-of-the-arts workswillfocusonthesetwodirections. References tureaggregationmethod. Thenumberofsamplesforeach videorangesfrom3to25. Wealsoreporttheresultsofus- [1] A.Diba,V.Sharma,andL.VanGool. Deeptemporallinear ingmaximumnumberofsamples(Max),whereweextract encodingnetworks. arXivpreprintarXiv:1611.06678,2016. featuresforeveryframe/clip(foropticalflow,weuseaslid- 2,3,5 ingwindowwithstepsizeequalto1).Onaverage,thereare [2] B.Fernando, E.Gavves, J.Oramas, A.Ghodrati, T.Tuyte- 92 framesfor each video in HMDB51 and 185frames for laars,andL.Belgium. Modelingvideoevolutionforaction eachvideoinUCF101. recognition. InCVPR,2015. 2 FromtheresultsshowninTable4,wecanseethat,after [3] M.HoaiandA.Zisserman. Improvinghumanactionrecog- acertainthreshold(15inthiscase),thenumberofsampled nitionusingscoredistributionandranking. InACCV,2014. frame/clipdoesn’thavegreatimpactontheoverallperfor- 2 mance.Asamplenumberof25isenoughtoachievesimilar [4] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. performanceas densely sample every frame/clips. This is Trajectory-based modeling of human actions with motion referencepoints. InECCV.2012. 2 consistent with the observation in [6] and largely because [5] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, theinformationredundancyamongframes. andL.Fei-Fei. Large-scalevideoclassificationwithconvo- 4.4.Comparisontothestate-of-the-arts lutionalneuralnetworks. InCVPR,2014. 2 [6] Z.Lan,L.Jiang,S.-I.Yu,S.Rawat,Y.Cai,C.Gao,S.Xu, InTable5,wecompareourbestperformancetothestate- H.Shen, X.Li,Y.Wang, etal. Cmu-informediaattrecvid of-the-arts. ComparedtotheTSN[18],fromwhichweim- 2013multimediaeventdetection. InTRECVID2013Work- prove upon, we get around 3% and 1% improvements on shop,volume1,page5,2013. 1,4,5 HMDB51andUCF101datasets,respectively.Theseresults [7] Z.Lan, M. Lin, X.Li, A.G.Hauptmann, andB.Raj. Be- aremuchbetterthantraditionalIDTbasedmethods[7]and yondgaussianpyramid: Multi-skipfeaturestackingforac- the originalTwo-stream CNNs [11]. Compare to TLE [1] tionrecognition. CVPR,2015. 2,3,5 andDeepQuantization[9],ourmaximumpoolingachieves [8] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of similar results to their more complexbilinear models. We visual words and fusion methods for action recognition: also show the results of fusing with MIFS 1 using late fu- Comprehensive study and good practice. arXiv preprint sion. ForHMDB51,theimprovementfromfusingMIFSis arXiv:1405.4506,2014. 2 very significant, we get more than 3% improvement. The [9] Z. Qiu, T. Yao, and T. Mei. Deep quantization: Encoding improvementforUCF101ismuchsmallerasitsaccuracyis convolutionalactivationswithdeepgenerativemodel. arXiv preprintarXiv:1611.09502,2016. 2,3,5 inaplacewhereisdifficulttoimprove. [10] M.Sapienza,F.Cuzzolin,andP.H.Torr. Featuresampling 5. Conclusions andpartitioningforvisualvocabularygenerationonlargeac- tionclassificationdatasets. arXivpreprintarXiv:1405.7545, In this paper, we propose a method to get in-domain 2014. 2 global video features by aggregating local neural network [11] K.SimonyanandA.Zisserman. Two-streamconvolutional features. We study a set of problems including what fea- networksforactionrecognitioninvideos. InNIPS,2014. 1, tures we should get, how to aggregatethese local features 2,3,5 intoglobalfeatures,andhowdenseweshouldextractthese [12] B. Varadarajan, G. Toderici, S. Vijayanarasimhan, and local features. After a set of experimentson UCF101 and A.Natsev. Efficientlargescalevideo classification. arXiv preprintarXiv:1505.06250,2015. 2 HMDB51datasets,weconcludesthat: 1)itisbettertoex- [13] A.VedaldiandA.Zisserman. Efficientadditivekernelsvia tracttheoutputsofthelastconvolutionlayerasfeatures;2) explicitfeaturemaps.IEEEtransactionsonpatternanalysis 1WedownloadthethepredictionscoresofMIFSfromhere andmachineintelligence,34(3):480–492,2012. 3 [14] H.Wang,A.Klaser,C.Schmid,andC.-L.Liu.Actionrecog- nitionbydensetrajectories. InCVPR,2011. 2 [15] H.Wang,C.Schmid,etal.Actionrecognitionwithimproved trajectories. InICCV,2013. 2,5 [16] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooleddeep-convolutionaldescriptors. 2015. 2 [17] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards GoodPracticesforVeryDeepTwo-StreamConvNets. arXiv preprintarXiv:1507.02159,2015. 1,2 [18] L.Wang,Y.Xiong,Z.Wang,Y.Qiao,D.Lin,X.Tang,and L. V. Gool. Temporal Segment Networks: Towards Good PracticesforDeepActionRecognition. InECCV,2016. 1, 2,3,5 [19] Z.Wu,X.Wang,Y.-G.Jiang,H.Ye,andX.Xue. Modeling spatial-temporalcluesinahybriddeeplearningframework for video classification. arXiv preprint arXiv:1504.01561, 2015. 2 [20] Z.Xu,Y.Yang,andA.G.Hauptmann. Adiscriminativecnn videorepresentationforeventdetection. InCVPR,2015. 1, 3,4 [21] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architec- turesfor unconstrained videoclassification. arXivpreprint arXiv:1503.04144,2015. 2,4