The SUN Action Database: Collecting and Analyzing Typical Actions for Visual Scene Types by Catherine Anne White Olsson Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2013 © Massachusetts Institute of Technology 2013. All rights reserved. A u th or ................................. ............................ Department of Electrical Engineering and Computer Science - r\ May 24, 2013 Certified by. Aude Oliva Principal Research Scientist -- 1 /.1 Thesis Supervisor C ertified by . ............... ............................. Antonio Torralba Associate Professor Thesis Supervisor Accepted by ...... ............................ ....... Prof. Dennis M. Freeman Chairman, Masters of Engineering Thesis Committee The SUN Action Database: Collecting and Analyzing Typical Actions for Visual Scene Types by Catherine Anne White Olsson Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2013, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Recent work in human and machine vision has increasingly focused on the problem of scene recognition. Scene types are largely defined by the actions one might typi- cally do there: an office is a place someone would typically "work". I introduce the SUN Action database (short for "Scene UNderstanding - Action"): the first effort to collect and analyze free-response data from human subjects about the typical ac- tions associated with different scene types. Responses were gathered on Mechanical Turk for twenty images per catgory, each depicting a characteristic view of one of 397 different scene types. The distribution of phrases is shown to be heavy-tailed and Zipf-like, whereas the distribution of semantic roots is not Zipf-like. Categories strongly associated with particular tasks or actions are shown to have lower overall diversity of responses. A hierarchical clustering analysis reveals a heterogeneous clus- tering structure, with some categories readily grouping together, and other categories remaining apart even at coarse clustering levels. Finally, two simple classifiers are introduced for predicting scene types from associated actions: a nearest centroid clas- sifier, and an empirical maximum likelihood classifier. Both classifiers demonstrate greater than 50% classification performance in a 397-way classification task. Thesis Supervisor: Aude Oliva Title: Principal Research Scientist Thesis Supervisor: Antonio Torralba Title: Associate Professor 3 4 Acknowledgments This work would not have been possible without the support of a great many people who have played a role in my education up to this point - advisors, mentors, teachers, colleagues, family, and friends. The following people are just a small subset of the vast number of people to whom I am extremely grateful: " Aude, for being a constant source of enthusiasm and encouragement. I could not have asked for a more understanding, empathetic, and motivating advisor. Aude has been a role model as someone unafraid to "dream big" and to push the frontiers of knowledge at the intersection of academic disciplines. It has been a joy to work alongside one of my academic heroes, sharing in her vision and energy. " Antonio, for inspiring me to be playful and to stay true to the engineering spirit: when the breadth of the bigger picture gets overwhelming, just try something and see what works! " My parents, for neverending support despite a distance of thousands of miles between us. Words cannot express my gratitude for their encouragement and pride. "Lots of Love!" " The many supportive and inspirational teachers, advisors, and mentors I have had over the years: Laura Schulz, Patrick Winston, Josh Tenenbaum, Rebecca Saxe, and Noah Goodman here at MIT, not to mention a great many teachers at Lakeside and in the PRISM program throughout my grade school experience. I am immensely indebted to the teachers throughout my life who have gone beyond simply imparting knowledge; who have taken the time to get to know me personally, and placed enormous trust in me and my abilities. " The hundreds of workers on Mechanical Turk without whom this work would quite literally not be possible, for their endless patience, and for their delightful sense of humor which kept me afloat during many tedious hours. " Michelle Greene, for sharing her LabelMe object analyses with me, which helped me immensely in figuring out how to wrap my head around this data and get an initial foothold. " The Writing and Communication Center, and to anyone who has ever asked me to write anything: every essay, report, or paper I've ever written has gone into preparing me for this. " Last but certainly not least, the friends and communities which imbue my life with meaning, context, stability, fulfillment, and overwhelming joy. You give me a reason to keep smiling, always. 5 6 Contents 1 Introduction 15 1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1.1 Perceiving and understanding visual scenes . . . . . . . . . . . 17 1.1.2 Associating scenes with actions . . . . . . . . . . . . . . . . . 18 1.1.3 Defining fine-grained scene categories . . . . . . . . . . . . . . 18 1.1.4 Datasets of actions: people in videos . . . . . . . . . . . . . . 19 1.1.5 Crowdsourcing attribute information for scenes with Mechani- cal Turk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Q uestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Structure and contributions of this work . . . . . . . . . . . . . . . . 21 2 Building a Dataset of Typical Actions for Scene Types 23 2.1 Stimuli: 397 scene types . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Interface: Mechanical Turk . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 A 33-category subset is easier to visualize . . . . . . . . . . . . . . . . 27 3 Describing the Distribution of Responses 29 3.1 Phrase statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 Phrase occurrence distribution is Zipf-like and heavy-tailed . . 30 3.1.2 Response diversity varies by category . . . . . . . . . . . . . . 31 3.2 Morphological stem statistics . . . . . . . . . . . . . . . . . . . . . . 33 7 3.2.1 Semantic content can be extracted by reducing constituent words to morphological stems . . . . . . . . . . . . . . . . . . . . . . 34 3.2.2 Responses were short . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 Morphological stem distribution is not Zipf-like . . . . . . . . 36 3.2.4 Morphological diversity varies by category . . . . . . . . . . . 38 4 Visualizing the Action-Similarity Space of Scenes 47 4.1 Normalized histogram similarity is a symmetric distance measure for frequency counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Similarity heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Hierarchical clustering shows heterogeneous grouping structure . . . . 49 5 Predicting Scene Type 57 5.1 Nearest-centroid classification . . . . . . . . . . . . . . . . . . . . . . 58 5.1.1 Text classification strategies can inform scene-action classification 59 5.1.2 Nearest-centroid classification is a simple approach that fulfills our constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.3 Results: Nearest centroid shows 51% classification accuracy over 397 classes . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Empirical maximum likelihood classification . . . . . . . . . . . . . . 68 5.2.1 Bayes' Rule enables reasoning about hidden factors . . . . . . 68 5.2.2 Simplifying assumptions enable us to estimate the likelihood . 69 5.2.3 Results: Maximum likelihood shows 67% classification accuracy over 397 classes . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 D iscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 Future Directions 77 6.1 Examine the effects of category name on actions . . . . . . . . . . . . 77 6.2 Correlate objects, materials, and spatial properties with actions . . . 78 6.3 Compare the similarity space of actions with other similarity spaces . 79 6.4 Analyze image-to-image variability . . . . . . . . . . . . . . . . . . . 80 8 6.5 Richly model the joint distribution . . . . . . . . . . . . . . . . . . . 81 6.6 Incorporate semantic properties of natural language . . . . . . . . . . 82 6.7 Relate action words directly to constituent objects . . . . . . . . . . . 82 6.8 Gather responses for scene type names without images . . . . . . . . 83 7 Contributions 85 A Additional tables 87 B Mechanical Turk Best Practices 111 References 115 9 10
Description: