Attention-Based Multimodal Fusion for Video Description ChioriHori TakaakiHori Teng-YokLee KazuhiroSumi∗ JohnR.Hershey TimK.Marks MitsubishiElectricResearchLaboratories(MERL) {chori, thori, tlee, sumi, hershey, tmarks}@merl.com 7 1 0 Abstract becoming available in the form of audio description pre- 2 pared for visually impaired users. Thus there is an op- Currently successful methods for video description are portunity to make significant progress in this area. We r a basedonencoder-decodersentencegenerationusingrecur- propose a video description method that uses an attention- M rent neural networks (RNNs). Recent work has shown the basedencoder-decodernetworktogeneratesentencesfrom advantage of integrating temporal and/or spatial attention inputvideo. 9 mechanisms into these models, in which the decoder net- Sentencegenerationusinganencoder-decoderarchitec- ] work predicts each word in the description by selectively ture was originally used for neural machine translation V giving more weight to encoded features from specific time (NMT), in which sentences in a source language are con- C frames(temporalattention)ortofeaturesfromspecificspa- verted into sentences in a target language [25, 5]. In this . tialregions(spatialattention). Inthispaper,weproposeto paradigm,theencodertakesaninputsentenceinthesource s c expandtheattentionmodeltoselectivelyattendnotjustto languageandmapsittoafixed-lengthfeaturevectorinan [ specific times or spatial regions, but to specific modalities embedding space. The decoder uses this feature vector as 2 of input such as image features, motion features, and au- input to generate a sentence in the target language. How- v diofeatures. Ournewmodality-dependentattentionmech- ever, the fixed length of the feature vector limited perfor- 6 anism,whichwecallmultimodalattention,providesanat- mance,particularlyonlonginputsentences,so[1]proposed 2 uralwaytofusemultimodalinformationforvideodescrip- to encode the input sentence as a sequence of feature vec- 1 tion. WeevaluateourmethodontheYoutube2Textdataset, tors, employing a recurrent neural network (RNN)-based 3 0 achieving results that are competitive with current state of soft attention model toenable the decoder to pay attention . the art. More importantly, we demonstrate that our model tofeaturesderivedfromspecificwordsoftheinputsentence 1 incorporatingmultimodalattentionaswellastemporalat- whengeneratingeachoutputword. 0 7 tention significantly outperforms the model that uses tem- Theencoder-decoderbasedsequencetosequenceframe- 1 poralattentionalone. work has been applied not only to machine translation : v but also to other application areas including speech recog- i nition [2], image captioning [25], and dialog manage- X 1.IntroductionandRelatedWork ment[16]. r a In image captioning, the input is a single image, and Automatic video description, also known as video cap- the output is a natural-language description. Recent work tioning,referstotheautomaticgenerationofanaturallan- onRNN-basedimagecaptioningincludes[17,25]. Toim- guagedescription(e.g.,asentence)thatsummarizesanin- proveperformance,[27]addedanattentionmechanism, to put video. Video description has widespread applications enablefocusingonspecificpartsoftheimagewhengener- including video retrieval, automatic description of home atingeachwordofthedescription. movies or online uploaded video clips, and video descrip- Encoder-decodernetworkshavealsobeenappliedtothe tionsforthevisuallyimpaired. Moreover, developingsys- taskofvideodescription[24]. Inthistask,theinputstothe temsthatcandescribevideosmayhelpustoelucidatesome encoder network are video information features that may keycomponentsofgeneralmachineintelligence. Videode- includestaticimagefeaturesextractedusingconvolutional scription research depends on the availability of videos la- neuralnetworks(CNNs), temporaldynamicsofvideosex- beledwithdescriptivetext. Alargeamountofsuchdatais tracted using spatiotemporal 3D CNNs [22], dense trajec- ∗On sabbatical from Aoyama Gakuin University, tories [26], optical flow, and audio features [12]. The de- [email protected] codernetworktakestheencoderoutputsandgeneratesword 1 sequences based on language models using recurrent neu- Output word sequence ral networks (RNNs) based on long short-term memory <sos> y1 y2 … <eos> (LSTM)units[9]orgatedrecurrentunits(GRUs)[4]. Such LSTM systemscanbetrainedend-to-endusingvideoslabeledwith decoder s0 s1 s2 textdescriptions. Oneinherentprobleminvideodescriptionisthatthese- quence of video features and the sequence of words in the LSTM h1 h2 h3 h4 h5 encoder description are not synchronized. In fact, objects and ac- tionsmayappearinthevideoinadifferentorderthanthey Feature x’1 x’2 x’3 x’4 x’5 extractor appear in the sentence. When choosing the right words to (CNN) x x x x x 1 2 3 4 5 describe something, only the features that directly corre- Input image sequence spondtothatobjectoractionarerelevant,andtheotherfea- Figure1.Anencoder-decoderbasedvideodescriptiongenerator. turesareasourceofclutter.ItmaybepossibleforanLSTM to learn to selectively encode different objects into its la- tent features and remember them until they are retrieved. mentary,inthateithercanprovidereliablecuesatdifferent However, attention mechanisms have been used to boost timesforsomeaspectofascene. Multimodalfusionisthus the network’s ability to retrieve the relevant features from an important longstanding strategy for robustness. How- thecorrespondingpartsoftheinput,inapplicationssuchas ever, optimally combining information requires estimating machinetranslation[1],speechrecognition[2],imagecap- thereliabilityofeachmodality, whichremains achalleng- tioning[27],anddialogmanagement[10]. Inrecentwork, ingproblem. Inthiswork, weproposethatthisestimation theseattentionmechanismshavebeenappliedtovideode- be performed by the neural network, by means of an at- scription [28, 29]. Whereas in image captioning the atten- tention mechanism that operating across different modali- tion is spatial (attending to specific regions of the image), ties(inadditiontoanyspatio-temporalattention). Bytrain- invideodescriptiontheattentionmaybetemporal(attend- ing the system end-to-end to perform the desired descrip- ing to specific time frames of the video) in addition to (or tion of the semantic content of the video, the system can insteadof)spatial. learn to use attention to fuse the modalities in a context- In this work, we propose a new use of attention: to sensitiveway. Wepresentexperimentsshowingthatincor- fuse information across different modalities. Here we use poratingmultimodalattention,inadditiontotemporalatten- modality loosely to refer to different types of features de- tion, significantly outperforms a corresponding model that rivedfromthevideo,suchasappearance,motion,ordepth, usestemporalattentionalone. aswellasfeaturesfromdifferentsensorssuchasvideoand audio features. Video descriptions can include a variety 2.Encoder-decoder-basedsentencegenerator of descriptive styles, including abstract descriptions of the scene, descriptions focused on objects and their relations, One basic approach to video description is based on and descriptions of action and motion, including both mo- sequence-to-sequence learning. The input sequence, i.e., tion in the scene and camera motion. The soundtrack also imagesequence,isfirstencodedtoafixed-dimensionalse- contains audio events that provide additional information mantic vector. Then the output sequence, i.e., word se- about the described scene and its context. Depending on quence,isgeneratedfromthesemanticvector. Inthiscase, what is being described, different modalities of input may boththeencoderandthedecoder(orgenerator)areusually beimportantforselectingappropriatewordsinthedescrip- modeled as Long Short-Term Memory (LSTM) networks. tion. For example, the description “A boy is standing on Figure 1 shows an example of the LSTM-based encoder- a hill” refers to objects and their relations. In contrast, “A decoderarchitecture. boy is jumping on a hill” may rely on motion features to Givenasequenceofimages,X = x ,x ,...,x ,each determinetheaction. ”Aboyislisteningtoairplanesflying 1 2 L imageisfirstfedtoafeatureextractor,whichcanbeapre- overhead” may require audio features to recognize the air- trainedCNNforanimageorvideoclassificationtasksuch planes, iftheydonotappearinthevideo. Notonlydothe as GoogLeNet [15], VGGNet [20], or C3D [22]. The se- relevant modalities change from sentence to sentence, but quenceofimagefeatures,X(cid:48) =x(cid:48),x(cid:48),...,x(cid:48) ,isobtained alsofromwordtoword,aswemovefromactionwordsthat 1 2 L byextractingtheactivationvectorofafully-connectedlayer describemotiontonounsthatdefineobjecttypes.Attention oftheCNNforeachinputimage.1 Thesequenceoffeature to the appropriate modalities, as a function of the context, vectors is then fed to the LSTM encoder, and the hidden may help with choosing the right words for the video de- scription. 1InthecaseofC3D,multipleimagesarefedtothenetworkatonceto Oftenfeaturesfromdifferentmodalitiescanbecomple- capturedynamicfeaturesinthevideo. stateoftheLSTMisgivenby Output word sequence ht =LSTM(ht−1,x(cid:48)t;λE), (1) <sos> y1 y2 … <eos> where the LSTM function of the encoder network λ is E computedas LSTM decoder s0 s1 s2 c c LSTM(h ,x ;λ)=o tanh(c ), (2) 1 2 t−1 t t t where o =σ(cid:0)W(λ)x +W(λ)h +b(λ)(cid:1) (3) Attention mechanism t xo t ho t−1 o α1,1 α2,5 c =f c +i tanh(cid:0)W(λ)x t t t−1 t xc t +W(λ)h +b(λ)(cid:1) (4) Feature extractor x’1 x’2 x’3 x’4 x’5 hc t−1 c (CNN) ft =σ(cid:0)Wx(λf)xt+Wh(λf)ht−1+b(fλ)(cid:1) (5) x1 x2 x3 x4 x5 i =σ(cid:0)W(λ)x +W(λ)h +b(λ)(cid:1), (6) Input image sequence t xi t hi t−1 i Figure2.Anencoder-decoderbasedsentencegeneratorwithtem- whereσ()istheelement-wisesigmoidfunction,andi ,f , t t poralattentionmechanism. o andc are,respectively,theinputgate,forgetgate,output t t gate,andcellactivationvectorsforthetthinputvector.The weight matrices W(λ) and the bias vectors b(λ) are identi- 3.Attention-basedsentencegenerator zz z fiedbythesubscriptz ∈{x,h,i,f,o,c}.Forexample,W hi Another approach to video description is an attention- isthehidden-inputgatematrixandW istheinput-output xo based sequence generator [6], which enables the network gate matrix. We did not use peephole connections in this toemphasizefeaturesfromspecifictimesorspatialregions work. depending on the current context, enabling the next word Thedecoderpredictsthenextworditerativelybeginning tobepredictedmoreaccurately. Comparedtothebasicap- with the start-of-sentence token, “<sos>” until it predicts proachdescribedinSection2,theattention-basedgenerator the end-of-sentence token, “<eos>.” Given decoder state canexploitinputfeaturesselectivelyaccordingtotheinput s , the decoder network λ infers the next word proba- i−1 D and output contexts. The efficacy of attention models has bilitydistributionas beenshowninmanytaskssuchasmachinetranslation[1]. (cid:16) (cid:17) P(y|s )=softmax W(λD)s +b(λD) , (7) Figure 2 shows an example of the attention-based sen- i−1 s i−1 s tencegeneratorfromvideo,whichhasatemporalattention and generates word y , which has the highest probability, i mechanismovertheinputimagesequence. accordingto The input sequence of feature vectors is obtained us- y =argmaxP(y|s ), (8) i i−1 ing one or more feature extractors. Generally, attention- y∈V based generators employ an encoder based on a bidirec- where V denotes the vocabulary. The decoder state is up- tionalLSTM(BLSTM)orGatedRecurrentUnits(GRU)to datedusingtheLSTMnetworkofthedecoderas furtherconvertthefeaturevectorsequencesothateachvec- s =LSTM(s ,y(cid:48);λ ), (9) torcontainsitscontextualinformation. Invideodescription i i−1 i D tasks,however,CNN-basedfeaturesareoftenuseddirectly, wherey(cid:48) isaword-embeddingvectorofy ,andtheinitial i m or one more feed-forward layer is added to reduce the di- states isobtainedfromthefinalencoderstateh andy(cid:48) = 0 L 0 mensionality. Embed(<sos>)asinFigure1. If we use an BLSTM encoder following the feature ex- Inthetrainingphase,Y =y ,...,y isgivenastheref- 1 M traction,thentheactivationvectors(i.e.,encoderstates)are erence. However,inthetestphase,thebestwordsequence obtainedas needstobefoundbasedon (cid:34) (cid:35) h(f) Yˆ =argmaxP(Y|X) (10) ht = ht(b) , (12) t Y∈V∗ = argmax P(y1|s0)P(y2|s1)··· where h(f) and h(b) are the forward and backward hidden t t y1,...,yM∈V∗ activationvectors: P(y |s )P(<eos>|s ). (11) M M−1 M h(f) =LSTM(h(f),x(cid:48);λ(f)) (13) Accordingly,weuseabeamsearchinthetestphasetokeep t t−1 t E multiplestatesandhypotheseswiththehighestcumulative h(b) =LSTM(h(b) ,x(cid:48);λ(b)). (14) t t+1 t E probabilitiesateachmthstep,andselectthebesthypothesis fromthosehavingreachedtheend-of-sentencetoken. Ifweuseafeed-forwardlayer,thentheactivationvectoris calculatedas y y y i-1 i i+1 g h =tanh(W x(cid:48) +b ), (15) i t p t p s s i-1 i whereW isaweightmatrixandb isabiasvector. Ifwe p p usetheCNNfeaturesdirectly,thenweassumeh =x(cid:48). t t The attention mechanism is realized by using attention Multimodal d fusion i weights to the hidden activation vectors throughout the in- putsequence. Theseweightsenablethenetworktoempha- c1,i c2,i sizefeaturesfromthosetimestepsthataremostimportant forpredictingthenextoutputword. woLrdetanαdi,tthbeetathniantpteunttifoenatuwreeigvhetctboer.twFeoerntthheeiitthhoouuttppuutt, α1,i,1 α1,i,2 α1,i,L α2,i,1 α2,i,2 α2,i,L’ thevectorrepresentingtherelevantcontentoftheinputse- x’ x’ x’ x’ x’ x’ 11 12 1L 21 22 2L’ quenceisobtainedasaweightedsumofhiddenunitactiva- tionvectors: x x x x x x L 11 12 1L 21 22 2L’ (cid:88) c = α h . (16) i i,t t Figure3.Simplefeaturefusion. t=1 The decoder network is an Attention-based Recurrent SequenceGenerator(ARSG)[1][6]thatgeneratesanoutput 4.Attention-basedmultimodalfusion label sequence with content vectors c . The network also i This section proposes an attention model to handle fu- hasanLSTMdecodernetwork,wherethedecoderstatecan sion of multiple modalities, where each modality has its beupdatedinthesamewayasEquation(9). own sequence of feature vectors. For video description, Then,theoutputlabelprobabilityiscomputedas multimodalinputssuchasimagefeatures,motionfeatures, (cid:16) (cid:17) P(y|s ,c )=softmax W(λD)s +W(λD)c +b(λD) , andaudiofeaturesareavailable. Furthermore,combination i−1 i s i−1 c i s ofmultiplefeaturesfromdifferentfeatureextractionmeth- (17) odsareofteneffectivetoimprovethedescriptionaccuracy. andwordy isgeneratedaccordingto i In [29], content vectors from VGGNet (image features) y =argmaxP(y|s ,c ). (18) and C3D (spatiotemporal motion features) are combined i i−1 i y∈V intoonevector,whichisusedtopredictthenextword.This isperformedinthefusionlayer,inwhichthefollowingac- In contrast to Equations (7) and (8) of the basic encoder- tivationvectoriscomputedinsteadofEq. (19), decoder, the probability distribution is conditioned on the content vector c , which emphasizes specific features that (cid:16) (cid:17) i g =tanh W(λD)s +d +b(λD) , (23) are most relant to predicting each subsequent word. One i s i−1 i s morefeed-forwardlayercanbeinsertedbeforethesoftmax where layer. In this case, the probabilities are computed as fol- d =W(λD)c +W(λD)c , (24) lows: i c1 1,i c2 2,i (cid:16) (cid:17) and c and c are two different content vectors ob- g =tanh W(λD)s +W(λD)c +b(λD) , (19) 1,i 2,i i s i−1 c i s tainedusingdifferentfeatureextractorsand/ordifferentin- putmodalities. and Figure 3 shows the simple feature fusion approach, P(y|s ,c )=softmax(W(λD)g +b(λD)). (20) in which content vectors are obtained with attention i−1 i g i g weights for individual input sequences x ,...,x and 11 1L Theattentionweightsarecomputedinthesamemanner x21,...,x2L(cid:48),respectively. However,thesecontentvectors asin[1]: are combined with weight matrices Wc1 and Wc2, which α = exp(ei,t) (21) arecommonlyusedinthesentencegenerationstep. Conse- i,t (cid:80)L exp(e ) quently,thecontentvectorsfromeachfeaturetype(orone τ=1 i,τ modality) are always fused using the same weights, inde- and (cid:124) pendent of the decoder state. This architecture lacks the e =w tanh(W s +V h +b ), (22) i,t A A i−1 A t A ability to exploit multiple types of features effectively, be- whereW andV arematrices,w andb arevectors,and causeitdoesnotallowtherelativeweightsofeachfeature A A A A e isascalar. type(ofeachmodality)tochangebasedonthecontext. i,t 5.Experiments y y y i-1 i i+1 gi 5.1.Dataset s s i-1 i We evaluated our proposed feature fusion using the Youtube2Text video corpus [8]. This corpus is well suited fortrainingandevaluatingautomaticvideodescriptiongen- β1,i β2,i erationmodels. Thedatasethas1,970videoclipswithmul- Attention-based multimodal fusion tiplenaturallanguagedescriptions.Eachvideoclipisanno- d1,i d2,i tatedwithmultipleparallelsentencesprovidedbydifferent c1,i c2,i Mechanical Turkers. There are 80,839 sentences in total, withabout41annotatedsentencesperclip. Eachsentence α1,i,1 α1,i,2 α1,i,L α2,i,1 α2,i,2 α2,i,L’ oalnlathveersaegnetecnocnetasincosnasbtoituutte8awvoordcsa.bTulhaerywoofrd1s3c,0o1n0taiunneidquine x’ x’ x’ x’ x’ x’ lexical entries. The dataset is open-domain and covers a 11 12 1L 21 22 2L’ wide range of topics including sports, animals and music. x x x x x x Following [38], we split the dataset into a training set of 11 12 1L 21 22 2L’ 1,200videoclips,avalidationsetof100clips,andatestset consistingoftheremaining670clips. Figure4.Ourmultimodalattentionmechanism. 5.2.VideoPreprocessing This paper extends the attention mechanism to multi- Theimagedataareextractedfromeachvideoclip,which modalfusion. Usingthismultimodalattentionmechanism, consist of 24 frames per second, and rescaled to 224×224 basedonthecurrentdecoderstate,thedecodernetworkcan pixel images. For extracting image features, a pretrained selectivelyattendtospecificmodalitiesofinput(orspecific GoogLeNet [15] CNN is used to extract fixed-length rep- featuretypes)topredictthenextword.LetKbethenumber resentation with the help of the popular implementation in ofmodalities,i.e.,thenumberofsequencesofinputfeature Caffe [11]. Features are extracted from the hidden layer vectors.Ourattention-basedfeaturefusionisperformedus- pool5/7x7 s1. Weselectoneframeoutofevery16frames ing fromeachvideoclipandfeedthemintotheCNNtoobtain (cid:32) K (cid:33) 1024-dimensionalframe-wisefeaturevectors. (cid:88) g =tanh W(λD)s + β d +b(λD) , (25) We also use a VGGNet [20] that was pretrained on the i s i−1 k,i k,i s ImageNet dataset [14]. The hidden activation vectors of k=1 fully connected layer fc7 are used for the image features, where which produces a sequence of 4096-dimensional feature d =W(λD)c +b(λD). (26) k,i ck k,i ck vectors. Furthermore,tomodelmotionandshort-termspa- The multimodal attention weights β are obtained in a tiotemporalactivity,weusethepretrainedC3D[22](which k,i similarwaytothetemporalattentionmechanism: wastrainedontheSports-1Mdataset[13]). TheC3Dnet- work reads sequential frames in the video and outputs a β = exp(vk,i) , (27) fixed-length feature vector every 16 frames. We extracted k,i (cid:80)K exp(v ) activation vectors from fully-connected layer fc6-1, which κ=1 κ,i has4096-dimensionalfeatures. where 5.3.AudioProcessing (cid:124) v =w tanh(W s +V c +b ), (28) k,i B B i−1 Bk k,i Bk Unlike previous methods that use the YouTube2Text whereW andV arematrices,w andb arevectors, dataset [28, 18, 29], we also incorporate audio features, B Bk B Bk andv isascalar. to use in our attention-based feature fusion method. Since k,i Figure 4 shows the architecture of our sentence gener- YouTube2Textcorpusdoesnotcontainaudiotrack,weex- ator, including the multimodal attention mechanism. Un- tracted the audio data via the original video URLs. Al- like the simple multimodal fusion method in Figure 3, in though a subset of the videos were no longer available Figure4,thefeature-levelattentionweightscanchangeac- on YouTube, we were able to collect the audio data for cordingtothedecoderstateandthecontentvectors,which 1,649 video clips, which covers 84% of the corpus. The enablesthedecodernetworktopayattentiontoadifferent 44 kHz-sampled audio data are down-sampled to 16 kHz, setoffeaturesand/ormodalitieswhenpredictingeachsub- andMel-FrequencyCepstralCoefficients(MFCCs)areex- sequentwordinthedescription. tractedfromeach50mstimewindowwith25msshift. The Table1.EvaluationresultsontheYouTube2Texttestset. Thelastthreerowsofthetablepresentpreviousstate-of-the-artmethods,which useonlytemporalattention.Therestofthetableshowsresultsfromourownimplementations.Thefirstthreerowsofthetableusetemporal attentionbutonlyonemodality(onefeaturetype).Thenexttworowsdomultimodalfusionoftwomodalities(imageandspatiotemporal) usingeitherSimpleMultimodalfusion(seeFigure3)orourproposedMultimodalAttentionmechanism(seeFigure4).Thenexttworows alsoperformmultimodalfusion,thistimeofthreemodalities(image,spatiotemporal,andaudiofeatures). Ineachcolumn,thescoresof thetoptwomethodsareshowninboldface. Modalities(featuretypes) Evaluationmetric Fusionmethod Attention Image Spatiotemporal Audio BLEU1 BLEU2 BLEU3 BLEU4 METEOR CIDEr Unimodal(RMSprop) Temporal GoogLeNet 0.766 0.643 0.547 0.440 0.295 0.568 Unimodal(RMSprop) Temporal VGGNet 0.800 0.677 0.574 0.464 0.309 0.654 Unimodal(RMSprop) Temporal C3D 0.785 0.664 0.569 0.464 0.304 0.578 SimpleMultimodal Temporal VGGNet C3D 0.824 0.708 0.606 0.498 0.322 0.665 (RMSprop) MultimodalAttention Temporal& VGGNet C3D 0.801 0.691 0.601 0.507 0.318 0.699 (AdaDelta) Multimodal SimpleMultimodal Temporal VGGNet C3D MFCC 0.819 0.709 0.614 0.510 0.321 0.679 (RMSprop) MultimodalAttention Temporal& VGGNet C3D MFCC 0.795 0.691 0.608 0.517 0.317 0.695 (AdaDelta) Multimodal TA[28] Temporal GoogLeNet 3DCNN 0.800 0.647 0.526 0.419 0.296 0.517 LSTM-E[18] VGGNet C3D 0.788 0.660 0.554 0.453 0.310 - h-RNN[29](RMSprop) Temporal VGGNet C3D 0.815 0.704 0.604 0.499 0.326 0.658 sequence of 13-dimensional MFCC features are then con- motivated metrics: BLEU [19], METEOR [7], and the catenatedintoonevectorfromeverygroupof20consecu- newly proposed metric for image description, CIDEr [23]. tiveframes,whichresultsinasequenceof260-dimensional We used the publicly available evaluation script pre- vectors. The MFCC features are normalized so that the pared for image captioning challenge [3]. Each video mean and variance vectors are 0 and 1 in the training set. inYouTube2Texthasmultiple“ground-truth”descriptions, Thevalidationandtestsetsarealsoadjustedwiththeorig- but some “ground-truth” answers are incorrect. Since inal mean and variance vectors of the training set. Un- BLEUandMETEORscoresforavideodonotconsiderfre- like with the image features, we apply a BLSTM encoder quencyofwordsinthegroundtruth,theycanbestronglyaf- network for MFCC features, which is trained jointly with fectedbyoneincorrectground-truthdescription.METEOR thedecodernetwork. Ifaudiodataaremissingforavideo is even more susceptible since it also accepts paraphrases clip,thenwefeedinasequenceofdummyMFCCfeatures, of incorrect ground-truth words. In contrast, CIDEr is a whichissimplyasequenceofzerovectors. voting-basedmetricthatisrobusttoerrorsinground-truth. 5.4.ExperimentalSetup 5.5.ResultsandDiscussion Thecaptiongenerationmodel,i.e. thedecodernetwork, Table1showstheevaluationresultsontheYoutube2text istrainedtominimizethecrossentropycriterionusingthe data set. We compared the performance for our multi- trainingset. Imagefeaturesarefedtothedecodernetwork modalattentionmodel(MultimodalAttention)whichinte- throughoneprojectionlayerof512units,whileaudiofea- gratedtemporalandmultimodalattentionmechanismswith tures,i.e. MFCCs,arefedtotheBLSTMencoderfollowed a simple additive multimodal fusion (Simple Multimodal), bythedecodernetwork. Theencodernetworkhasonepro- unimodal models with temporal attention (Unimodal), and jectionlayerof512unitsandbidirectionalLSTMlayersof baselinesystemsthatusedtemporalattention. 512 cells. The decoder network has one LSTM layer with The Simple Multimodal model performed better than 512 cells. Each word is embedded to a 256-dimensional the Unimodal models. The proposed Multimodal Atten- vector when it is fed to the LSTM layer. We compared tion model consistently outperformed Simple Multimodal. the AdaDelta optimizer [30] and RMSprop [?] to update The audio feature helped the performance of the baseline. theparameters, whichiswidelyusedforoptimizingatten- Combining the audio features using our modal-attention tionmodels. TheLSTMandattentionmodelswereimple- method reached the best performance of BLUE. How- mentedusingChainer[21]. ever,themodal-attentionmethodwithouttheaudiofeature Thesimilaritybetweengroundtruthandautomaticvideo reachedthebestperformanceofCIDEr. Theaudiofeature descriptionresultsareevaluatedusingmachine-translation- didnothelpalways.ThisisbecausesomeYouTubedatain- cludesnoisesuchasbackgroundmusic,whichisunrelated machine translation. In Proceedings of the 2014 Confer- tothevideocontent. Weneedtoanalyzethecontributionof enceonEmpiricalMethodsinNaturalLanguageProcessing, theaudiofeatureindetail. EMNLP2014, October25-29, 2014, Doha, Qatar, Ameet- Incontrasttotheexistingsystems,ourtemporalattention ingofSIGDAT,aSpecialInterestGroupoftheACL,pages 1724–1734,2014. system which used only static image features (Unimodal) [6] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and outperformedTAusingcombinationofstaticimageanddy- Y.Bengio. Attention-basedmodelsforspeechrecognition. namic video features [28]. Our proposed attention mecha- InC.Cortes,N.D.Lawrence,D.D.Lee,M.Sugiyama,and nisms outperformed LSTM-E [18] which does not use at- R. Garnett, editors, Advances in Neural Information Pro- tentionmechanisms. OurSimpleMultimodalsystemusing cessingSystems28,pages577–585.CurranAssociates,Inc., temporal attention is the same basic structure used by h- 2015. RNNaswellasthesamefeaturesextractedfromVGGNet [7] M.J.DenkowskiandA.Lavie. Meteoruniversal:Language [20] and C3D [22]. While h-RNN used L2 regularization specific translation evaluation for any target language. In andRMSprop,weusedL2regularizationforallexperimen- Proceedings of the Ninth Workshop on Statistical Machine tal conditions and compared RMSprop and AdaDelta. Al- Translation,WMT@ACL2014,June26-27,2014,Baltimore, though RMSprop outperformed for Umimodal and Simple Maryland,USA,pages376–380,2014. Multimodal, AdaDelta outperformed for Multimodal At- [8] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, tentiontion. S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activi- 6.Conclusion tiesusingsemantichierarchiesandzero-shotrecognition. In ProceedingsoftheIEEEInternationalConferenceonCom- Weproposedanewmodality-dependentattentionmech- puterVision,pages2712–2719,2013. anism, which we call multimodal attention, for video de- [9] S.HochreiterandJ.Schmidhuber.Longshort-termmemory. scriptionbasedonencoder-decodersentencegenerationus- NeuralComputation,9(8):1735–1780,1997. ingrecurrentneuralnetworks(RNNs).Inthisapproach,the [10] T. Hori, H. Wang, C. Hori, S. Watanabe, B. Harsham, attentionmodelselectivelyattendsnotjusttospecifictimes, J. L. Roux, J. Hershey, Y. Koji, Y. Jing, Z. Zhu, and T. Aikawa. Dialog state tracking with attention-based but to specific modalities of input such as image features, sequence-to-sequencelearning. In2016IEEESpokenLan- spatiotemporal motion features, and audio features. This guage Technology Workshop, SLT 2016, San Diego, CA, approach provides a natural way to fuse multimodal infor- USA,December13-16,2016. mation for video description. We evaluate our method on [11] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- the Youtube2Text dataset, achieving results that are com- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- petitive with current state-of-the-art methods that employ tionalarchitectureforfastfeatureembedding.arXivpreprint temporal attention models, in which the decoder network arXiv:1408.5093,2014. predicts each word in the description by selectively giving [12] Q.Jin,J.Liang,andX.Lin. GeneratingNaturalVideoDe- moreweighttoencodedfeaturesfromspecifictimeframes. scriptionsviaMultimodalProcessing. InInterspeech,2016. Moreimportantly,wedemonstratethatourmodelincorpo- [13] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, rating multimodal attention as well as temporal attention andL.Fei-Fei. Large-scalevideoclassificationwithconvo- outperformsthemodelthatusestemporalattentionalone. lutionalneuralnetworks. InProceedingsoftheIEEEcon- ferenceonComputerVisionandPatternRecognition,pages References 1725–1732,2014. [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine classification with deep convolutional neural networks. In translationbyjointlylearningtoalignandtranslate. CoRR, F.Pereira,C.J.C.Burges,L.Bottou,andK.Q.Weinberger, abs/1409.0473,2014. editors,AdvancesinNeuralInformationProcessingSystems [2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and 25,pages1097–1105.CurranAssociates,Inc.,2012. Y. Bengio. End-to-end attention-based large vocabulary [15] M.Lin,Q.Chen,andS.Yan. Networkinnetwork. CoRR, speechrecognition. pages4945–4949,2016. abs/1312.4400,2013. [3] X.Chen,H.Fang,T.Lin,R.Vedantam,S.Gupta,P.Dolla´r, [16] R.Lowe,N.Pow,I.Serban,andJ.Pineau. Theubuntudi- andC.L.Zitnick. MicrosoftCOCOcaptions: Datacollec- aloguecorpus: Alargedatasetforresearchinunstructured tionandevaluationserver. CoRR,abs/1504.00325,2015. multi-turndialoguesystems.InProceedingsoftheSIGDIAL [4] K.Cho, B.vanMerrienboer, D.Bahdanau, andY.Bengio. 2015 Conference, The 16th Annual Meeting of the Special On the properties of neural machine translation: Encoder- Interest Group on Discourse and Dialogue, 2-4 September decoderapproaches. CoRR,abs/1409.1259,2014. 2015,Prague,CzechRepublic,pages285–294,2015. [5] K. Cho, B. van Merrienboer, C¸. Gu¨lc¸ehre, D. Bahdanau, [17] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep F.Bougares, H.Schwenk, andY.Bengio. Learningphrase captioning with multimodal recurrent neural networks (m- representations using RNN encoder-decoder for statistical rnn). CoRR,abs/1412.6632,2014. [18] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly model- ingembeddingandtranslationtobridgevideoandlanguage. CoRR,abs/1505.01861,2015. [19] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a methodforautomaticevaluationofmachinetranslation. In Proceedingsofthe40thAnnualMeetingoftheAssociation for Computational Linguistics, July 6-12, 2002, Philadel- phia,PA,USA.,pages311–318,2002. [20] K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. CoRR, abs/1409.1556,2014. [21] S.Tokui,K.Oono,S.Hido,andJ.Clayton. Chainer:anext- generationopensourceframeworkfordeeplearning.InPro- ceedingsofWorkshoponMachineLearningSystems(Learn- ingSys) in The Twenty-ninth Annual Conference on Neural InformationProcessingSystems(NIPS),2015. [22] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d con- volutionalnetworks.In2015IEEEInternationalConference onComputerVision,ICCV2015,Santiago,Chile,December 7-13,2015,pages4489–4497,2015. [23] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575,2015. [24] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney,andK.Saenko. Translatingvideostonaturallan- guageusingdeeprecurrentneuralnetworks.InNAACLHLT 2015,The2015ConferenceoftheNorthAmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Denver, Colorado, USA,May31- June5,2015,pages1494–1504,2015. [25] O.Vinyals,A.Toshev,S.Bengio,andD.Erhan. Showand tell: Aneuralimagecaptiongenerator. InIEEEConference on Computer Vision and Pattern Recognition, CVPR 2015, Boston,MA,USA,June7-12,2015,pages3156–3164,2015. [26] H. Wang, A. Kla¨ser, C. Schmid, and C.-L. Liu. Action RecognitionbyDenseTrajectories. InIEEEConferenceon ComputerVision&PatternRecognition,pages3169–3176, ColoradoSprings,UnitedStates,June2011. [27] K.Xu,J.Ba,R.Kiros,K.Cho,A.C.Courville,R.Salakhut- dinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Ma- chineLearning,ICML2015,Lille,France,6-11July2015, pages2048–2057,2015. [28] L.Yao,A.Torabi,K.Cho,N.Ballas,C.J.Pal,H.Larochelle, andA.C.Courville. Describingvideosbyexploitingtem- poralstructure. In2015IEEEInternationalConferenceon ComputerVision,ICCV2015,Santiago,Chile,December7- 13,2015,pages4507–4515,2015. [29] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraphcaptioningusinghierarchicalrecurrentneuralnet- works. CoRR,abs/1510.07712,2015. [30] M.D.Zeiler.ADADELTA:anadaptivelearningratemethod. CoRR,abs/1212.5701,2012.