ebook img

Human activity analysis: A review PDF

43 Pages·2011·3.01 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Human activity analysis: A review

Human Activity Analysis: A Review J.K.AGGARWAL,UniversityofTexasatAustin M.S.RYOO,ElectronicsandTelecommunicationsResearchInstitute,DaejeonandUniversity ofTexasatAustin Humanactivityrecognitionisanimportantareaofcomputervisionresearch.Itsapplicationsincludesurveil- lance systems, patient monitoring systems, and a variety of systems that involve interactions between personsandelectronicdevicessuchashuman-computerinterfaces.Mostoftheseapplicationsrequireanau- tomatedrecognitionofhigh-levelactivities,composedofmultiplesimple(oratomic)actionsofpersons.This articleprovidesadetailedoverviewofvariousstate-of-the-artresearchpapersonhumanactivityrecognition. Wediscussboththemethodologiesdevelopedforsimplehumanactionsandthoseforhigh-levelactivities. Anapproach-basedtaxonomyischosenthatcomparestheadvantagesandlimitationsofeachapproach. Recognitionmethodologiesforananalysisofthesimpleactionsofasinglepersonarefirstpresentedin thearticle.Space-timevolumeapproachesandsequentialapproachesthatrepresentandrecognizeactiv- itiesdirectlyfrominputimagesarediscussed.Next,hierarchicalrecognitionmethodologiesforhigh-level 16 activitiesarepresentedandcompared.Statisticalapproaches,syntacticapproaches,anddescription-based approachesforhierarchicalrecognitionarediscussedinthearticle.Inaddition,wefurtherdiscussthepapers ontherecognitionofhuman-objectinteractionsandgroupactivities.Publicdatasetsdesignedfortheeval- uationoftherecognitionmethodologiesareillustratedinourarticleaswell,comparingthemethodologies’ performances.Thisreviewwillprovidetheimpetusforfutureresearchinmoreproductiveareas. Categories and Subject Descriptors: I.2.10 [Artificial Intelligence]: Vision and Scene Understanding— Motion;I.4.8[ImageProcessingandComputerVision]:SceneAnalysis;I.5.4[PatternRecognition]: Applications—Computervision GeneralTerms:Algorithms AdditionalKeyWordsandPhrases:Computervision,humanactivityrecognition,eventdetection,activity analysis,videorecognition ACMReferenceFormat: Aggarwal,J.K.andRyoo,M.S.2011.Humanactivityanalysis:Areview.ACMComput.Surv.43,3,Article16 (April2011),43pages. DOI=10.1145/1922649.1922653http://doi.acm.org/10.1145/1922649.1922653 1. INTRODUCTION Human activity recognition is an important area of computer vision research today. The goal of human activity recognition is to automatically analyze ongoing activities from an unknown video (i.e. a sequence of image frames). In a simple case where a video is segmented to contain only one execution of a human activity, the objective of the system is to correctly classify the video into its activity category. In more general ThisworkwassupportedinpartbytheTexasHigherEducationCoordinatingBoardunderaward003658- 0140-2007. Authors’addresses:J.K.Aggarwal,ComputerandVisionResearchCenter,DepartmentofElectricaland Computer Engineering, the University of Texas at Austin, Austin, TX 78705; M. S. Ryoo, Robot Re- searchDepartment,ElectronicsandTelecommunicationsResearchInstitute,Daejeon305-700,Korea;email: [email protected]. Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgranted withoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthat copiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalongwiththefullcitation.Copyrightsfor componentsofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted. Tocopyotherwise,torepublish,topostonservers,toredistributetolists,ortouseanycomponentofthis work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481,[email protected]. (cid:2)c 2011ACM0360-0300/2011/04-ART16$10.00 DOI10.1145/1922649.1922653 http://doi.acm.org/10.1145/1922649.1922653 ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. 16:2 J.K.AggarwalandM.S.Ryoo Fig.1. Thehierarchicalapproach-basedtaxonomyofthisreview. cases, the continuous recognition of human activities must be performed by detecting startingandendingtimesofalloccurringactivitiesfromaninputvideo. Theabilitytorecognizecomplexhumanactivitiesfromvideosenablestheconstruc- tionofseveralimportantapplications.Automatedsurveillancesystemsinpublicplaces like airports and subway stations require detection of abnormal and suspicious activ- ities, as opposed to normal activities. For instance, an airport surveillance system must be able to automatically recognize suspicious activities like “a person leaving a bag” or “a person placing his/her bag in a trash bin.” Recognition of human activities also enables the real-time monitoring of patients, children, and elderly persons. The constructionofgesture-basedhumancomputerinterfacesandvision-basedintelligent environmentsbecomespossiblewithanactivityrecognitionsystemaswell. There are various types of human activities. Depending on their complexity, we conceptually categorize human activities into four different levels: gestures, actions, interactions, and group activities. Gestures are elementary movements of a person’s body part, and are the atomic components describing the meaningful motion of a per- son.“Stretchinganarm”and“raisingaleg”aregoodexamplesofgestures.Actionsare single-person activities that may be composed of multiple gestures organized tempo- rally, such as “walking,” “waving,” and “punching.” Interactions are human activities thatinvolvetwoormorepersonsand/orobjects.Forexample,“twopersonsfighting”is aninteractionbetweentwohumansand“apersonstealingasuitcasefromanother”is ahuman-objectinteractioninvolvingtwohumansandoneobject.Finally,groupactiv- ities are the activities performed by conceptual groups composed of multiple persons and/or objects: “A group of persons marching,” “a group having a meeting,” and “two groupsfighting”aretypicalexamples. The objective of this article is to provide a complete overview of state-of-the-art humanactivityrecognitionmethodologies.Wediscussvarioustypesofapproachesde- signed for the recognition of different levels of activities. The previous review written by Aggarwal and Cai [1999] covered several essential low-level components for the understanding of human motion, such as tracking and body posture analysis. How- ever, the motion analysis methodologies themselves were insufficient to describe and annotateongoinghumanactivitieswithcomplexstructures,andmostofapproachesin 1990sfocusedontherecognitionofgesturesandsimpleactions.Inthisnewreview,we concentrateonhigh-levelactivityrecognitionmethodologiesdesignedfortheanalysis ofhumanactions,interactions,andgroupactivities,discussingrecentresearchtrends inactivityrecognition. Figure1illustratesanoverviewofthetree-structuredtaxonomythatourreviewfol- lows.Wehavechosenanapproach-basedtaxonomy.Allactivityrecognitionmethodolo- giesarefirstclassifiedintotwocategories:single-layeredapproachesandhierarchical approaches.Single-layeredapproachesarethosethatrepresentandrecognizehuman activities directly based on sequences of images. Due to their nature, single-layered approaches are suitable for the recognition of gestures and actions with sequential ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. HumanActivityAnalysis:AReview 16:3 Fig.2. Detailedtaxonomyforsingle-layeredapproachesandthelistsofselectedpublicationscorresponding toeachcategory. characteristics. On the other hand, hierarchical approaches represent high-level hu- manactivitiesbydescribingthemintermsofothersimpleractivities,whicharegener- allycalledsubevents.Recognitionsystemscomposedofmultiplelayersareconstructed, thusmakingthemsuitablefortheanalysisofcomplexactivities. Single-layeredapproachesareagainclassifiedintotwotypesdependingonhowthey model human activities: that is, space-time approaches and sequential approaches. Space-timeapproachesviewaninputvideoasa3-D(XYT)volume,whilesequentialap- proachesinterpretitasasequenceofobservations.Space-timeapproachesarefurther dividedintothreecategoriesbasedonthefeaturestheyusefromthe3-Dspace-timevol- umes: volumes themselves, trajectories, or local interest point descriptors. Sequential approaches are classified depending on whether they use exemplar-based recognition methodologies or model-based recognition methodologies. Figure 2 shows a detailed taxonomy used for single-layered approaches covered in the review, together with a numberofpublicationscorrespondingtoeachcategory. Hierarchical approaches are classified on the basis of the recognition methodolo- gies they use: statistical approaches, syntactic approaches, and description-based ap- proaches.Statisticalapproachesconstructstatisticalstate-basedmodelsconcatenated hierarchically (e.g., layered hidden Markov models) to represent and recognize high- levelhumanactivities.Similarly,syntacticapproachesuseagrammarsyntaxsuchas a stochastic context-free grammar (SCFG) to model sequential activities. Essentially, theymodelahigh-levelactivityasastringofatomic-levelactivities.Description-based approaches represent human activities by describing subevents of the activities and theirtemporal,spatial,andlogicalstructures.Figure3presentslistsofrepresentative publicationscorrespondingtocategories. In addition, in Figures 2 and 3, we point to previous work that recognizes human- object interactions and group activities by using different colors and by attaching “O” (object) and “G” (group) tags to the right-hand side. The recognition of human- object interactions requires the analysis of interplays between object recognition and activity analysis. This article provides a survey on the methodologies focusing on the analysisofsuchinterplaysfortheimprovedrecognitionofhumanactivities.Similarly, the recognition of groups and the analysis of their structures is necessary for group activitydetection,andinthisreviewwecoverthemaswell. This review is organized as follows: Section 2 covers single-layered approaches. In Section 3 we review hierarchical recognition approaches for the analysis of high-level activities. Section 4.1 discusses recognition methodologies for interactions between ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. 16:4 J.K.AggarwalandM.S.Ryoo Fig.3. Detailedtaxonomyforhierarchicalapproachesandthelistsofpublicationscorrespondingtoeach category. humans and objects, while concentrating especially on how previous work handled interplays between object recognition and motion analysis. Section 4.2 presents work on group activity recognition. In Section 5.1 we review available public datasets and comparethesystemstestedonthem.Inaddition,Section5.2coversreal-timesystems forhumanactivityrecognition.Section6concludesthearticle. 1.1. ComparisonswithPreviousReviews Therehavebeenotherrelatedsurveysonhumanactivityrecognition.Severalprevious reviews on human motion analysis [Cedras and Shah 1995; Gavrila 1999; Aggarwal andCai1999]discussedhumanactionrecognitionapproachesasapartoftheirreview. Kruger et al. [2007] reviewed human action recognition approaches while classify- ing them on the basis of the complexity of features involved in the action recognition process.Theirreviewfocusedespeciallyontheplanningaspectofhumanactionrecog- nitions, considering their potential application to robotics. The Turaga et al. [2008] surveycoveredhumanactivityrecognitionapproaches,similartoours.Intheirpaper, approaches are first categorized based on the complexity of the activities that they want to recognize, and are then classified in terms of the recognition methodologies theyuse. However, most of the previous reviews have focused on the introduction and sum- marization of activity recognition methodologies, but do not provide a means to lack of compare different types of human activity recognition approaches. In this review, we present interclass and intraclass comparisons between approaches, while provid- inganoverviewofhumanactivityrecognitionapproacheswhicharecategorizedonthe approach-basedtaxonomypresentedabove.Tobeabletocomparetheabilitiesofrecog- nitionmethodologiesisessentialforustotakeadvantageofthem.Ourgoalistoenable areader(evenonefromadifferentfield)tounderstandthecontextofthedevelopment of human activity recognition and comprehend the advantages and disadvantages of thedifferentapproachcategories. We use a more elaborate taxonomy and compare and contrast each approach category in detail. For example, differences between single-layered approaches and hierarchicalapproachesarediscussedthehighest-levelofourreview,whilespace-time approaches are compared with sequential approaches in an intermediate level. We comparetheabilitiesofprevioussystemswithineachclassaswell,pointingoutwhat they are able to recognize and what they are not. Furthermore, our review covers ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. HumanActivityAnalysis:AReview 16:5 recognition methodologies for complex human activities, including human-object interactionsandgroupactivities,whichpreviousreviewshavenotfocusedon.Finally, we discuss the public datasets used by the systems, and compare the performance of therecognitionmethodologiesonthedatasets. 2. SINGLE-LAYEREDAPPROACHES Single-layered approaches recognize human activities directly from video data. Such approachesconsideranactivityasaparticularclassofimagesequences,andrecognize the activity from an unknown image sequence (i.e., an input) by categorizing it into its class. Various representation methodologies and matching algorithms have been developedtoenabletherecognitionsystemtomakeanaccuratedecisionastowhether an image sequence belongs to a certain activity class or not. For recognition from continuous videos, most single-layered approaches have adopted a sliding windows techniquethatclassifiesallpossiblesubsequences.Single-layeredapproachesaremost effectivewhenaparticularsequentialpatternthatdescribesanactivitycanbecaptured from training sequences. Due totheir nature, the main objective of the single-layered approacheshasbeentoanalyzerelativelysimple(andshort)sequentialmovementsof humans,suchaswalking,jumping,andwaving. Inthisreview, wecategorize single-layered approaches intotwo classes:space-time approachesandsequentialapproaches.Space-timeapproachesmodelahumanactivity asaparticular3-Dvolumeinaspace-timedimensionorasetoffeaturesextractedfrom thevolume.Thevideovolumesareconstructedbyconcatenatingimageframesalonga timeaxis,andarecomparedinordertomeasuretheirsimilarities.Ontheotherhand, sequentialapproachestreatahumanactivityasasequenceofparticularobservations. More specifically, they represent a human activity as a sequence of feature vectors extractedfromimagesandtheyrecognizeactivitiesbysearchingforsuchasequence. We discuss space-time approaches in Section 2.1 and compare sequential approaches inSection2.2. 2.1. Space-TimeApproaches Animageis2-dimensionaldataformulatedbyprojectinga3-Dreal-worldscene,andit contains spatial configurations (e.g., shapes and appearances) of humans and objects. A video is a sequence of those 2-D images placed in chronological order. Therefore, a video input containing an execution of an activity can be represented as a particular 3-DXYTspace-timevolumeconstructedbyconcatenating2-D(XY)imagesalongtime (T). Space-time approaches are those that recognize human activities by analyzing the space-timevolumesofactivityvideos.Atypicalspace-timeapproachforhumanactivity recognition is as follows. Based on the training videos, the system constructs a model 3-D XYT space-time volume representing each activity. When an unlabeled video is provided, the system constructs a 3-D space-time volume corresponding to the new video.Thenew3-Dvolumeiscomparedwitheachactivitymodel(i.e.,templatevolume) to measure the similarity in shape and appearance between the two volumes. The system finally deduces that the new video corresponds to the activity that has the highest similarity. This example can be viewed as a typical space-time methodology usingthe3-Dspace-timevolumerepresentationandthetemplate-matchingalgorithm forrecognition.Figure4showsexample3-DXYTvolumescorrespondingtothehuman actionofpunching. In addition to the pure 3-D volume representation, there are several variations of the space-time representation. First, the system may represent an activity as a trajectory (instead of a volume) in a space-time dimension or other dimensions. If the system is able to track feature points such as estimated joint positions of a human, ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. 16:6 J.K.AggarwalandM.S.Ryoo Fig.4. ExampleXYTvolumesconstructedbyconcatenating(a)entireimagesand(b)foregroundblobimages obtainedfromapunchingsequence. themovementsofthepersonperforminganactivitycanberepresentedmoreexplicitly as a set of trajectories. Second, instead of representing an activity with a volume or a trajectory, the system may represent an action as a set of features extracted from the volume or the trajectory. 3-D volumes can be viewed as rigid objects, and extracting commonpatternsfromthemenablestheirrepresentation. Researchers have also focused on developing various recognition algorithms using space-time representations to correctly match volumes, trajectories, or their features. We have already seen a typical example of an approach that uses template-matching, whichconstructsarepresentativemodel(i.e.,avolume)peractionusingtrainingdata. Activity recognition is done by matching the model with the volume constructed from inputs. Neighbor-based matching algorithms (i.e., discriminative methods) have also been applied widely. In the case of neighbor-based matching, the system maintains a set of sample volumes (or trajectories) to describe an activity. The recognition is performed by matching the input with all (or a portion) of them. Finally, statistical modeling algorithms have been developed that match videos by explicitly modeling a probabilitydistributionofanactivity. Accordingly, we have classified space-time approaches into several categories. A representation-based taxonomy and a recognition-based taxonomy have been jointly appliedfortheclassification.Thatis,eachoftheactivityrecognitionpublicationswith space-time approaches are assigned to a slot corresponding to a specific (representa- tion, recognition) pair. The left part of Figure 2 shows a detailed hierarchy tree of space-timeapproaches. 2.1.1.ActionRecognitionwithSpace-TimeVolumes.The core of the recognition using space-time volumes is in the similarity measurement between two volumes. The sys- tem must be able to compute how similar human movements described in the two volumes are. In order to calculate the correct similarities, various types of space-time volume representations and recognition methodologies have been developed. Instead ofconcatenatingentireimagesalongtime,someapproachesonlystacktheforeground regionsofaperson(i.e.,silhouettes)totrackshapechangesexplicitly[BobickandDavis 2001].Anapproachtocomparevolumesintermsoftheirpatcheshasbeenproposedas well[ShechtmanandIrani2005].Keetal.[2007]usedover-segmentedvolumes,auto- maticallycalculatingasetof3-DXYTvolumesegmentsthatcorrespondstoamoving human.Rodriguezetal.[2008]generatedfilterscapturingcharacteristicsofvolumes, inordertomatchvolumesmorereliablyandefficiently.Inthissection,wecovereachof theseapproacheswhilefocusingonourtaxonomyof“whattypesofspace-timevolume theyuse”and“howtheymatchvolumestorecognizeactivities.” ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. HumanActivityAnalysis:AReview 16:7 Fig.5. Examplesofspace-timeactionrepresentation:motion-historyimagesfromBobickandDavis[2001] ((cid:2)c2001IEEE).Thisrepresentationcanbeviewedasanweightedprojectionofa3-DXYTvolumeintoa2-D XYdimension. BobickandDavis[2001]constructedareal-timeactionrecognitionsystemusingtem- plate matching. Instead of maintaining the 3-dimensional space-time volume of each action, they represented each action with a template composed of two 2-dimensional images:a2-dimensionalbinarymotion-energyimage(MEI)andascalar-valuedmotion- history image (MHI). The two images are constructed from a sequence of foreground images, which essentially are weighted 2-D (XY) projections of the original 3-D XYT space-timevolume.Byapplyingatraditionaltemplate-matchingtechniquetoapairof (MEI,MHI),theirsystemwasabletorecognizesimpleactionslikesitting,armwaving, andcrouching.Further,theirreal-timesystemhasbeenappliedtotheinteractiveplay environmentofchildrencalledtheKids-Room.Figure5showsexampleMHIs. ShechtmanandIrani[2005]haveestimatedmotionflowsfroma3-Dspace-timevol- umetorecognizehumanactions.Theyhavecomputeda3-Dspace-timevideo-template correlation, measuring the similarity between an observed video volume and main- tainedtemplatevolumes.Theirsimilaritymeasurementcanbeviewedasahierarchi- cal space-time volume correlation. At every location of the volume (i.e., (x,y,t)), they extracted a small space-time patch around the location. Each volume patch captures theflowofaparticularlocalmotion,andthecorrelationbetweenapatchinatemplate and a patch in video at the same location gives a local match score to the system. Byaggregatingthesescores,theoverallcorrelationbetweenthetemplatevolumeand a video volume is computed. When an unknown video is given, the system searches for all possible 3-D volume segments centered at every (x,y,t) that best matches the template (i.e., sliding windows). Their system was able to recognize various types of humanactions,includingballetmovements,pooldives,andwaving. Keetal.[2007]usedsegmentedspatio-temporalvolumestomodelhumanactivities. Their system applies a hierarchical meanshift to cluster similarly colored voxels, and obtains several segmented volumes. The motivation is to find the actor volume ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. 16:8 J.K.AggarwalandM.S.Ryoo segments automatically and to measure their similarity to the action model. Recog- nition is done by searching for a subset of over-segmented spatio-temporal volumes thatbestmatchestheshapeoftheactionmodel.Supportvectormachines(SVM)have been applied to recognize human actions while considering both shapes and flows of thevolumes.Asaresult,theirsystemrecognizedsimpleactionssuchashandwaving andboxingfromtheKTHactiondatabase[Schuldtetal.2004]aswellastennisplays inTVbroadcastvideoswithmorecomplexbackgrounds. Rodriguezetal.[2008]haveanalyzed3-Dspace-timevolumesbysynthesizingfilters: theyadoptedthemaximumaveragecorrelationheight(MACH)filtersthathavebeen usedforananalysisofimages(e.g.,objectrecognition),tosolvetheactionrecognition problem. That is, they have generalized the traditional 2-D MACH filter for 3-D XYT volumes. For each action class, one synthesized filter that fits the observed volume is generatedandtheactionclassificationisperformedbyapplyingthesynthesizedaction MACH filter and analyzing its response on the new observation. They have further extended the MACH filters to analyze vector-valued data using the Clifford Fourier transform. They have not only tested their system on the existing KTH dataset and the Weizmann dataset [Blank et al. 2005], but also on their own dataset constructed by gathering clips from movie scenes. Actions such as kissing and hitting have been recognized. Table I compares the abilities of the space-time volume-based action recognition approaches. The major disadvantage of space-time volume approaches is the diffi- culty in recognizing actions when multiple persons are present in the scene. Most of the approaches apply the traditional sliding-window algorithm to solve this problem. However, this requires a large number computations for the accurate localization of actions.Furthermore,theyhavedifficultyrecognizingactionsthatcannotbespatially segmented. 2.1.2.ActionRecognitionwithSpace-TimeTrajectories.Trajectory-based approaches are recognitionapproachesthatinterpretanactivityasasetofspace-timetrajectories.In trajectory-basedapproaches,apersonisgenerallyrepresentedasasetof2-dimensional (XY) or 3-dimensional (XYZ) points corresponding to his/her joint positions. Human bodypartestimationmethodologies,especiallystickfiguremodeling,havebeenwidely usedtoextractthejointpositionsofapersonateachimageframe.Asahumanperforms an action, changes in his/her joint position are recorded as space-time trajectories, constructing 3-D XYT or 4-D XYZT representations of the action. Figure 6 shows example trajectories. The early work done by Johansson [1975] suggested that the tracking of joint positions is itself sufficient for humans to distinguish actions, and this paradigm has been studied for the recognition of activities in depth [Webb and Aggarwal1982;NiyogiandAdelson1994]. Several approaches used the trajectories themselves (i.e., sets of 3-D points) to rep- resent and recognize actions directly [Sheikh et al. 2005; Yilmaz and Shah 2005b]. Sheikh et al. [2005] represented an action as a set of 13 joint trajectories in a 4-D XYZTspace.TheyhaveusedanaffineprojectiontoobtainnormalizedXYTtrajectories ofanactioninordertomeasuretheview-invariantsimilaritybetweentwosetsoftra- jectories.YilmazandShah[2005b]presentedamethodologytocompareactionvideos obtainedfrommovingcameras,alsousingasetof4-DXYZTjointtrajectories. Campbell and Bobick [1995] recognized human actions by representing them as curves in low-dimensional phase spaces. In order to track joint positions, they took advantageofthe3-Dbody-partmodelsofaperson.Basedonthe3-DXYZmodelsesti- matedforeachframe,theyhavedefinedbodyphasespaceasaspacewhereeachaxis representsanindependentparameterofthebody(e.g.,ankle-angleorknee-angle)orits firstderivative.Intheirphasespace,aperson’sstaticstateateachframecorresponds ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. HumanActivityAnalysis:AReview 16:9 Multiplentactivities √ √ √ ation”showsareinvariantdtemporally. ewria √√√ √ (cid:2) deresan Viocalizationinva√√√√√√√√ √ √√√√√ √ Structuralconsihertheapproachurringspatially L e.“hetocc mportantSpace-TimeApproaches StructuralScaleconsiderationinvariantVolume-basedTemplatesneededVolume-basedScalingrequiredVolume-basedTemplatesneeded√Volume-based√√Orderingonly√Orderingonly√√√√√√Orderingonly√√√√√Proximity-based√Co-occuronly√Grid-based√√ aryfortheapproachtobeapplicabl“viewinvariant”columnsdescribeworrectlylocatewheretheactivityiseactivitiesinthesamescene. TableI.AComparisontheAbilitiesofI RequiredAuthorslow-levelsBobickandJ.Davis’01BackgroundShechtmanandIrani’05NoneKeetal.’07NoneRodriguezetal.’08NoneCampbellandBobick’95Body-partestimationesRaoandShah’01SkindetectionSheikhetal.’05Body-partestimationChomatandCrowley’99NoneZalnik-ManorandIrani’01NoneLaptevandLindeberg’03NoneShuldtetal.’04NoneDollaretal.’05NoneYilmazandShah’05aBackgroundBlanketal.’05BackgroundNieblesetal.’06NoneWongetal.’07NoneSavareseetal.’08NoneLiuandShah’08NoneLaptevetal.’08NoneRyooandAggarwal’09bNone dlow-levels”specifiesthelow-levelcomponentsnecessattheapproachisabletocapture.“Scaleinvariant”andangesinvideos;“localization”indicatestheabilitytocndicatesthatthesystemisdesignedtoconsidermultipl proachpe ace-timevolume ace-timetrajectori ace-timefeatures ecolumn“requiremporalpatternsthscaleandviewchultipleactivities’i ApTy Sp Sp Sp Thteto‘M ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011. 16:10 J.K.AggarwalandM.S.Ryoo Fig.6. Anexampleoftrajectoriesofhumanjointpositionswhenperformingthehumanactionofwalking [Sheikhetal.2005]((cid:2)c2005IEEE).Figure(a)showstrajectoriesinXYZspace,and(b)showsthoseinXYT space. toapoint,andanactioncorrespondstoasetofpoints(i.e.,acurve).Theauthorshave projected the curve in the phase space into multiple 2-D subspaces, and maintained the projected curves to represent the action. Each curve is modeled as a cubic poly- nomial form, indicating that the authors assume the actions to be relatively simple in the projected subspace. Among all possible curves of 2-D subspaces, their system automatically selects the top kstable and reliable ones to be used for the recognition process. Onceanactionrepresentation,thatis,asetofprojectedcurves,hasbeenconstructed, Campbell and Bobick recognized the action by also converting an unseen video into a set of points in the phase space. Without explicitly analyzing the dynamics of the points from the unseen video, their system simply verifies whether the points are on themaintainedcurves(i.e.,trajectoriesinthesubspaces)whenprojected.Varioustypes ofbasicballetmovementshavebeensuccessfullyrecognizedwithmarkersattachedto asubjecttotrackjointpositions. Instead of maintaining trajectories to represent human actions, Rao and Shah’s [2001]methodologyextractsmeaningfulcurvaturepatternsfromthetrajectories.They havetrackedthepositionofahandin2-Dimagespaceusingtheskinpixeldetection, obtaininga3-DXYTspace-timecurve.Theirsystemextractsthepositionsofpeaksof trajectorycurves,representinganactionasasetofpeaksandintervalsbetweenthem. They verified that these peak features are view-invariant. Automated learning of the humanactionsispossibleintheirsystembyincrementallyconstructingseveralaction prototypes as representations of human actions. These prototypes can be considered action templates, and the overall recognition process can be regarded as a template- matchingprocess.Asaresult,byanalyzingthepeaksoftrajectories,theirsystemwas able to recognize human actions in an office environment such as “opening a cabinet” and“pickingupanobject.” Again, Table I compares the trajectory-based approaches. The major advantage of such approaches is their ability to analyze the details of human movements. Fur- thermore, most of these methods are view-invariant. However, in order to do so, such methods generally require a strong low-level component that is able to correctly esti- mate the 3-D XYZ joint locations of persons appearing in a scene. The problem of the 3-Dbody-partdetectionandtrackingisstillanunsolvedproblemandresearchersare activelyworkinginthisarea. 2.1.3.ActionRecognitionUsingSpace-TimeLocalFeatures.The approaches discussed in thissectionuselocalfeaturesextractedfrom3-Dspace-timevolumestorepresentand ACMComputingSurveys,Vol.43,No.3,Article16,Publicationdate:April2011.

Description:
analysis, video recognition. ACM Reference Format: Aggarwal, J. K. and Ryoo, M. S. 2011. Human activity analysis: A review. ACM Comput. Surv.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.