ebook img

T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects PDF

3.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects

T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects Toma´sˇ Hodanˇ1,PavelHaluza1,Sˇteˇpa´nObdrzˇa´lek1,Jiˇr´ıMatas1,ManolisLourakis2,XenophonZabulis2 1CenterforMachinePerception,CzechTechnicalUniversityinPrague,CzechRepublic 2InstituteofComputerScience,FoundationforResearchandTechnology–Hellas,Heraklion,Greece 7 Abstract 1 0 We introduce T-LESS, a new public dataset for estimat- 2 ingthe6Dpose,i.e.translationandrotation,oftexture-less n rigid objects. The dataset features thirty industry-relevant a J objects with no significant texture and no discriminative 9 color or reflectance properties. The objects exhibit sym- 1 metriesandmutualsimilaritiesinshapeand/orsize. Com- pared to other datasets, a unique property is that some of ] theobjectsarepartsofothers.Thedatasetincludestraining V andtestimagesthatwerecapturedwiththreesynchronized C sensors, specifically a structured-light and a time-of-flight . s RGB-D sensor and a high-resolution RGB camera. There c are approximately 39K training and 10K test images from [ eachsensor. Additionally,twotypesof3Dmodelsarepro- 1 vided for each object, i.e. a manually created CAD model v and a semi-automatically reconstructed one. Training im- 8 agesdepictindividualobjectsagainstablackbackground. 9 4 Testimagesoriginatefromtwentytestsceneshavingvary- 5 ing complexity, which increases from simple scenes with 0 severalisolatedobjectstoverychallengingoneswithmul- . 1 tipleinstancesofseveralobjectsandwithahighamountof 0 clutterandocclusion.Theimageswerecapturedfromasys- 7 tematically sampled view sphere around the object/scene, 1 Figure 1. Examples of T-LESS test images (left) overlaid andareannotatedwithaccurategroundtruth6Dposesof : withcolored3Dobjectmodelsatthegroundtruth6Dposes v allmodeledobjects. Initialevaluationresultsindicatethat (right). Instances of the same object have the same color. i X thestateoftheartin6Dobjectposeestimationhasample The goal is to find instances of the modeled objects and roomforimprovement,especiallyindifficultcaseswithsig- r estimatetheir6Dposes. a nificant occlusion. The T-LESS dataset is available online atcmp.felk.cvut.cz/t-less. jectposecanbeusedtoenhanceone’sperceptionofreality byaugmentingobjectswithextrainformationsuchashints 1.Introduction forassemblyguidance. Texture-lessrigidobjectsarecommoninhumanenviron- The visual appearance of a texture-less object is domi- mentsandtheneedtolearn, detectandaccuratelylocalize natedbyitsglobalshape,color,reflectanceproperties,and them from images arises in a variety of applications. The the configuration of light sources. The lack of texture im- poseofarigidobjecthassixdegreesoffreedom,i.e. three pliesthattheobjectcannotbereliablyrecognizedwithtra- in translation and three in rotation, and its full knowledge ditional techniques relying on photometric local patch de- is often required. In robotics, for example, the 6D object tectors and descriptors [9, 31]. Instead, recent approaches posefacilitatesspatialreasoningandallowsanend-effector that can deal with texture-less objects have focused on lo- toactuponanobject. Inanaugmentedrealityscenario,ob- cal3Dfeaturedescription[33,51,19], andsemi-globalor globaldescriptionrelyingprimarilyonintensityedgesand 1 2 3 4 5 6 depth cues [20, 24, 54, 5, 14, 21, 27]. Therefore, RGB- D data consisting of aligned color and depth images, ob- tainedwithwidelyavailableKinect-likesensors,havecome 7 8 9 10 11 12 toplayanimportantrole. In this paper, we introduce a new public dataset for 6D pose estimation of texture-less rigid objects. An overview 13 14 15 16 17 18 oftheincludedobjectsandtestscenesisprovidedinFig.2. Thedatasetfeaturesthirtycommodityelectricalpartswhich havenosignificanttexture, discriminativecolorordistinc- 19 20 21 22 23 24 tive reflectance properties, and often bear similarities in shape and/or size. Furthermore, a unique characteristic of theobjectsisthatsomeofthemarepartsofothers. Forex- ample, objects7and8arebuiltupfromobject6, object9 25 26 27 28 29 30 is made of three copies of object 10 stacked next to each other, whilst the center part of objects 17 and 18 is nearly identicaltoobject13. Objectsexhibitingsimilarproperties arecommoninindustrialenvironments. 1 2 3 4 The dataset includes training and test images captured withatripletofsensors,i.e. astructuredlightRGB-Dsen- sorPrimesenseCarmine1.09,atime-of-flightRGB-Dsen- 5 6 7 8 sorMicrosoftKinectv2,andanRGBcameraCanonIXUS 950 IS. The sensors were time-synchronized and had sim- ilar perspectives. All images were obtained with an auto- maticprocedurethatsystematicallysampledimagesfroma 9 10 11 12 view sphere, resulting in ~39K training and ~10K test im- ages from each sensor. The training images depict objects inisolationwithablackbackground,whilethetestimages 13 14 15 16 originate from twenty table-top scenes with arbitrarily ar- ranged objects. Complexity of the test scenes varies from thosewithseveralisolatedobjectsandacleanbackground toverychallengingoneswithmultipleinstancesofseveral 17 18 19 20 objectsandwithahighamountofocclusionandclutter.Ad- ditionally,thedatasetcontainstwotypesof3Dmeshmod- elsforeachobject;onemanuallycreatedinCADsoftware andonesemi-automaticallyreconstructedfromthetraining Figure 2. T-LESS includes training images and 3D mod- RGB-D images. All occurrences of the modeled objects els of 30 objects (top) and test images of 20 scenes (bot- in the training and test images are annotated with accurate tom) – shown overlaid with colored 3D object models at ground truth 6D poses; see Fig. 1 for their qualitative and the ground truth poses. The images were captured from a Sec.4.1fortheirquantitativeevaluation. systematicallysampledviewspherearoundanobject/scene Thedatasetisintendedforevaluatingvariousflavorsof and are annotated with accurate ground truth 6D poses of the 6D object pose estimation problem [23] and other re- allmodeledobjects. lated problems, such as 2D object detection [50, 22] and objectsegmentation[49,17]. Sinceimagesfromthreesen- sorsareavailable,onemayalsostudytheimportanceofdif- ferentinputmodalitiesforagivenproblem. Anotheroption in complexity, so that it would provide different levels of istousethetrainingimagesforevaluating3Dobjectrecon- difficultyandbereasonablyfuture-proof,i.e. solvable,but struction methods [44], where the provided CAD models notsolvedbythecurrentstate-of-the-artmethods.Thediffi- canserveasthegroundtruth. cultyofthedatasetfor6Dobjectposeestimationisdemon- Our objectives in designing T-LESS were to provide a stratedbytherelativelylowperformanceofthemethodby datasetofasubstantialbutmanageablesize,witharigorous Hodanˇ etal.[24]. Thismethodotherwiseachievesaperfor- andcompletegroundtruthannotationthatisaccuratetothe mance close to the state of the art on the well-established levelofsensorresolution,andwithasignificantvariability datasetofHinterstoisseretal.[20]. The remainder of the paper is organized as follows. models of another 6 textured objects and 170 test images Sec. 2 reviews related datasets, Sec. 3 describes technical depictingtheobjectsplacedonakitchentable. detailsoftheacquisitionandpost-processingoftheT-LESS The Challenge and Willow datasets [58], which were dataset, Sec. 4 assesses the accuracy of the ground truth collectedforthe2011ICRASolutionsinPerceptionChal- posesandprovidesinitialevaluationresults,andSec.5con- lenge, shareasetof35texturedhouseholdobjects. Train- cludesthepaper. ingdataforeachobjectisgivenintheformof37RGB-D training images that show the object from different views, 2.RelatedDatasets plus a color point cloud obtained by merging the training images. The Challenge and Willow datasets respectively First we review datasets for estimating the 6D pose of contain176and353testRGB-Dimagesofseveralobjects specificrigidobjects, groupedbythetypeofprovidedim- insingleinstancesplacedontopofaturntable. TheWillow ages, then we mention a few datasets designed for simi- datasetalsofeaturesdistractorobjectsandobjectocclusion. larproblems. Ifnotstatedotherwise,thesedatasetssupply SimilaristheTUWdataset[1]thatpresents17texturedand groundtruthannotationsintheformof6Dobjectposes. texture-less objects appearing in 224 test RGB-D images. Insteadofaturntablesetup,imageswereobtainedbymov- 2.1.RGB-DDatasets ingarobotaroundastaticclutteredenvironmentwithsome objectsappearinginmultipleinstances. Only a few public RGB-D datasets, from over one hundred reported by Firman in [15], enable the evalua- The Rutgers dataset [37] is focused on perception for tion of 6D object pose estimation methods. Most of the roboticmanipulationduringpick-and-placetasksandcom- datasets reviewed in this section were captured with Mi- prises of images from a cluttered warehouse environment. crosoft Kinectv1 or Primesense Carmine1.09, which rep- It includes color 3D mesh models for 24 mostly textured resent the first generation of consumer-grade RGB-D sen- objectsfromtheAmazonPickingChallenge2015[11],that sorsoperatingonthestructured-lightprinciple. Thedataset were captured in more than 10K test RGB-D images with introduced in [17] was captured with Microsoft Kinectv2, variousamountsofocclusion. whichisbasedonthetime-of-flightprinciple. Aldomaetal.[2]provide3Dmeshmodelswithoutcolor For texture-less objects, the dataset of Hinterstoisser et information of 35 household objects that are both textured al.[20]hasbecomeastandardbenchmarkusedinmostof andtexture-lessandareoftensymmetricandmutuallysim- the recent work, e.g. [38, 4, 47, 24, 54]. It contains 15 ilarinshapeandsize. Thereare50testRGB-Dimagesof texture-lessobjectsrepresentedbyacolor3Dmeshmodel. table-top scenes with multiple objects in single instances, Eachobjectisassociatedwithatestsequenceconsistingof withnoclutterandvariouslevelsofocclusion. ~1200RGB-Dimages,eachofwhichincludesexactlyone The BigBIRD dataset [42] includes images of 125 instanceoftheobject.Thetestsequencesfeaturesignificant mostly textured objects that were captured in isolation on 2D and 3D clutter, but only mild occlusion, and since the aturntablewithmultiplecalibratedRGB-DandDSLRsen- objects have discriminative color, shape and/or size, their sors.Foreachobject,thedatasetprovides600RGB-Dpoint recognition is relatively easy. In the 6D localization prob- clouds, 600 high-resolution RGB images, and a color 3D lem (where information about the number and identity of meshmodelreconstructedfromthepointclouds.SinceBig- objectspresentintheimagesisprovidedbeforehand[23]), BIRD was acquired under very controlled conditions, it is state-of-the-art methods achieve recognition rates that ex- not concerned with occlusions, clutter, lighting changes or ceed95%formostoftheobjects. Brachmannetal.[4]pro- varying object-sensor distance. Georgakis et al. [17] pro- videdadditionalgroundtruthposesforallmodeledobjects vide6735testRGB-Dimagesfromkitchenscenesinclud- inoneofthetestsequencesfrom[20]. Thisextendedanno- ingasubsetoftheBigBIRDobjects. Groundtruthforob- tation introduces challenging test cases with various levels jects in the test images is provided only in the form of 2D ofocclusionandallowstheevaluationofmultipleobjectlo- boundingboxesand3Dpointlabeling. calization,witheachobjectappearinginasingleinstance. Laietal.[29]createdanextensivedatasetwith300com- Tejani et al. [47] presented a dataset with 2 texture-less mon household objects captured on a turntable from three and 4 textured objects. For each object, a color 3D mesh elevations.Itcontains250KsegmentedRGB-Dimagesand modelisprovidedtogetherwithatestsequenceofover700 22 annotated video sequences with a few hundred RGB-D RGB-Dimages. Theimagesshowseveralobjectinstances framesineach.Groundtruthisprovidedonlyintheformof with no to moderate occlusion, and with 2D and 3D clut- approximate rotation angles for training images and in the ter. Doumanoglouetal.[14]provideadatasetwith183test formof3Dpointlabelingfortestimages. imagesof2texturedobjectsfrom[47]that appearinmul- Schletteetal.[40]synthesizedRGB-Dimagesfromsim- tiple instances in a challenging bin-picking scenario with ulated object manipulation scenarios involving 4 texture- heavyocclusion. Furthermore,theyprovidecolor3Dmesh less objects from the Cranfield assembly benchmark [10]. Several small datasets that were used for evaluation of the SHOT descriptor are provided by Salti et al. [39]. These (cid:126) datasetsincludesyntheticdataaswellasdataacquiredwith 3 aspacetime-stereomethodandanRGB-Dsensor. 2.2.Depth-onlyandRGB-onlyDatasets (cid:126) The depth-only dataset of Mian et al. [34] includes 3D 2 meshmodelsof5objectsand50testdepthimagesacquired with an industrial range scanner. The test scenes contain only the modeled objects that occlude each other. A sim- ilar dataset is provided by Taati et al. [46]. The Desk3D (cid:126) 1 dataset [3] comprises of 3D mesh models for 6 objects which are captured in over 850 test depth images with oc- clusion,clutterandsimilarlylookingdistractorobjects.The datasetwasobtainedwithanRGB-Dsensor,howeveronly thedepthimagesarepubliclyavailable. Figure3. Acquisitionsetup: 1)turntablewithmarkerfield, TheIKEAdatasetbyLimetal.[30]providesRGBim- 2)screenensuringablackbackgroundfortrainingimages, ageswithobjectsbeingalignedwiththeirexactlymatched removed when capturing test images, 3) triplet of sensors 3D models. Crivellaro et al. [12] supply 3D CAD mod- attachedtoajigwithadjustabletilt. els and annotated RGB sequences with 3 highly occluded and texture-less objects. Mun˜oz et al. [36] provide RGB sequences of 6 texture-less objects that are each imaged inisolationagainstacleanbackgroundandwithoutocclu- sion. Furthertotheabove,thereexistRGBdatasetssuchas [13,50,38,25],forwhichthegroundtruthisprovidedonly intheformof2Dboundingboxes. 2.3.DatasetsforSimilarProblems The RGB-D dataset of Michel et al. [35] is focused on articulated objects, where the goal is to estimate the 6D pose of each object part, subject to the constraints intro- ducedbytheirjoints. Therearealsodatasetsforcategorical poseestimation.Forexample,the3DNet[55]andtheUoB- Figure 4. Sample training (top) and test (bottom) im- HOOC[53]containgeneric3DmodelsandRGB-Dimages ages. Left: RGB-DimagesfromPrimesenseCarmine1.09. annotated with 6D object poses. The UBC VRS [32], the Middle: RGB-DimagesfromMicrosoftKinectv2. Right: RMRC (a subset of NYU Depth v2 [41] with annotations High-resolutionRGBimagesfromCanonIXUS950IS.For derived from [18]), the B3DO [26], and the SUN RGB- theRGB-Dimages,bottom-lefthalvesshowtheRGBcom- D[43]provideno3Dmodelsandgroundtruthonlyinthe ponentswhereasthetop-rightshowthedepthcomponents. form of bounding boxes. The PASCAL3D+ [57] and the ObjectNet3D [56] provide generic 3D models and ground truth6Dposes,butonlyRGBimages. The rest of the section describes the process of dataset 3.TheT-LESSDataset preparation,whichincludesimageacquisition,cameracal- ibration,depthcorrection,3Dobjectmodelgenerationand Comparedtotherevieweddatasets,T-LESSisuniquein thegroundtruthposeannotation. itscombinationofthefollowingcharacteristics. Itcontains 1) alarger numberof industry-relevant objects, 2) training 3.1.AcquisitionSetup imagescapturedundercontrolledconditions,3)testimages withlargeviewpointchanges,objectsinmultipleinstances, Thetrainingandtestimageswerecapturedwiththeaid affected by clutter and occlusion; including test cases that ofthesetupshowninFig.3.Itconsistsofaturntable,where arechallengingevenforthestate-of-the-artmethods,4)im- the imaged objects were placed, and a jig with adjustable agescapturedwithasynchronizedandcalibratedtripletof tilt,towhichthesensorswereattached.Amarkerfieldused sensors,5)accurategroundtruth6Dposesforallmodeled forcameraposeestimationwasaffixedtotheturntable.The objects,and6)twotypesof3Dmodelsforeachobject. fieldwasextendedverticallytothesidesoftheturntableto facilitate pose estimation at lower elevations. To capture images that are acquired with a dense sampling of view- training images, the objects were placed in the middle of points, e.g. [13, 20, 38, 24]. To support such approaches, the turntable and in the front of a black screen, which en- T-LESS offers training images of every object in isolation suredauniformbackgroundatallelevations. Tointroduce fromafullviewsphere. Theseimageswereobtainedwith anon-uniformbackgroundinthetestimages,asheetofply- a systematic acquisition procedure which uniformly sam- woodwithmarkersatitsedgeswasplacedonthetopofthe pled elevation from 85◦ to −85◦ with a 10◦ step and the turntable. In some scenes, the objects were placed on the completeazimuthrangewitha5◦ step. Viewsfromtheup- topofotherobjects(e.g.books)togivethemdifferentele- per and lower hemispheres were captured separately, turn- vationsandthusinvalidateagroundplaneassumptionthat ing the object upside down in between. In total, there are might be made by an evaluated method. The depth of ob- 18×72 = 1296trainingimagesperobjectfromeachsen- ject surfaces in the training and test images is in the range sor. Exceptionsareobjects19and20,forwhichonlyviews 0.53 − 0.92m, which is within the sensing ranges of the fromtheupperhemispherewerecaptured,specifically648 usedRGB-Dsensorsthatare0.35−1.4mforCarmineand images from elevation 85◦ to 5◦. These objects are hori- 0.5−4.5mforKinect. zontally symmetric at the pose in which they were placed ontheturntable,thustheviewsfromtheupperhemisphere 3.2.CalibrationofSensors aresufficienttocapturetheirappearance. Testsceneswere capturedfromaviewhemispherewitha10◦ stepineleva- Intrinsic and distortion parameters of the sensors were tion(rangingfrom75◦ to15◦)anda5◦ stepinazimuth. A estimated withthe standard checkerboard-based procedure totalof7×72 = 504testimageswerecapturedperscene using OpenCV [6]. The root mean square re-projection byeachsensor. error calculated at corners of the calibration checkerboard squares is 0.51px for Carmine, 0.35px for Kinect, and Toremoveirrelevantpartsofthesceneintheimagespe- 0.43pxforCanon. FortheRGB-Dsensors,thecalibration riphery, the provided images are cropped versions of the was performed with the RGB images. The depth images captured ones. Resolution of the provided images is as were aligned to the RGB images using the factory depth- follows: 400 × 400px for training RGB-D images from to-colorregistrationavailablethroughmanufacturer’sSDKs CarmineandKinect,1900×1900pxfortrainingRGBim- (OpenNI2.2andKinectforWindowsSDK2.0). Thecolor agesfromCanon,720×540pxfortestRGB-Dimagesfrom andaligneddepthimages,whichareincludedinthedataset, CarmineandKinectand2560×1920pxfortestRGBim- are already processed to remove radial distortion. The in- agesfromCanon. SampleimagesareshowninFig.4. trinsicparameterscanbefoundatthedatasetwebsite. Parts of the marker field were visible in some of the All sensors were synchronized and extrinsically cali- trainingimages,especiallyatlowerelevations. Thesewere brated with respect to the turntable, making it possible to masked to ensure a black background everywhere around register any pair of images. Synchronization was essential the objects. To achieve this, we identified an object mask since the images were taken while the turntable was spin- inanimagebyback-projectingitsCADmodelandgradu- ning. The extrinsic calibration was achieved using fidu- ally darkened the image moving from the mask perimeter cialBCH-codemarkersfromARToolKitPlus[52]. Specifi- towardstheimageborder. cally,thedetectionofparticularmarkersinanimagecom- 3.4.DepthCorrection binedwiththeknowledgeoftheirphysicallocationonthe turntableprovidedasetof2D-3Dcorrespondences. These Similarly to [16, 45], we observed that the depths mea- were used to estimate the camera pose in the turntable co- suredbytheRGB-Dsensorsexhibitasystematicerror. To ordinate system by robustly solving the PnP problem and remove it, we collected depth measurements d at projec- then refining the estimated 6D pose by non-linearly min- tions of the marker corners and computed their expected imizing the cumulative re-projection error with the posest depth values d from the known marker coordinates. The e library from [31]. The root mean square re-projection er- measurements were collected from the depth range 0.53– ror, which was calculated at marker corners in all test im- 0.92m in which the objects appear in the training and test ages,is1.27pxforCarmine,1.37pxforKinect,and1.50px images. We found the following linear correction models forCanon. Thismeasurecombineserrorsinsensorcalibra- byleastsquaresfitting:d =1.0247·d−5.19forCarmine, c tion,markerfielddetectionandsensorposeestimationand and d = 1.0266·d−26.88 for Kinect (depth measured c is therefore larger than the aforementioned error in sensor in mm). In [45], only scaling is usedfor the depth correc- intrinsiccalibration. tion. According to Foix et al. [16], a 3-degree polynomial function suffices to correct depth in the 1–2m range. In 3.3.TrainingandTestImages our case, a narrower range is used and we found a simple Acommonstrategyfordealingwithpoorlytexturedob- linear polynomial to adequately account for the error: the jectsistoadoptatemplate-basedapproachtrainedonobject correction reduced the mean absolute difference from the expected depth d from 12.4mm to 2.8mm for Carmine e andfrom7.0mmto3.6mmforKinect. Theestimatedcor- rectionwasappliedtoalldepthimages,requiringnofurther actionfromthedatasetuser. 3.5.3DObjectModels For each object, a manually created CAD model and a semi-automatically reconstructed model are available (Fig. 5). Both models are provided in the form of 3D meshes with surface normals at model vertices. Surface color is included only for the reconstructed models. The normals were calculated using MeshLab [7] as the angle- weightedsumoffacenormalsincidenttoavertex[48]. Thereconstructedmodelswerecreatedusingfastfusion, Figure5. Examplesof3Dobjectmodels. Top: Manually avolumetric3DmappingsystembySteinbru¨ckeretal.[44]. created CAD models. Bottom: Semi-automatically recon- The input to fastfusion were the RGB-D training images structed models which include also surface color. Surface from Carmine and the associated camera poses estimated normalsatmodelverticesareincludedinbothmodeltypes. using the fiducial markers (see Sec. 3.2). For each object, twopartialmodelswerefirstreconstructed,onefortheup- perandanotherforthelowerviewhemisphere. Thepartial 3.6.GroundTruthPoses models were then aligned using the iterative closest point (ICP)algorithmappliedtotheirvertices.Thiswasfollowed To obtain ground truth 6D object poses for images of a by manual refinement that ensured correct registration of test scene, a dense 3D model of the scene was first recon- surface details that are visible only in color. The result- structed with the system of Steinbru¨cker et al. [44]. This ing alignment was applied to the camera poses to trans- wasaccomplishedusingall504RGB-Dimagesofthescene form them into a common reference frame, and the up- along with the sensor poses estimated using the turntable dated poses were used to reconstruct the full object model markers. The CAD object models were then manually fromallimages. Thesemodelscontainedsomeminorarti- alignedtothescenemodel. Toincreaseaccuracy,theobject facts, e.g. small spikes, which were removed manually. It modelswererenderedintoseveralselectedhigh-resolution isnotedthatsomeoftheobjectscontainsmallshinymetal scene images from Canon, misalignments were identified parts whose depth is not reliably captured by the current andtheposesweremanuallyrefinedaccordingly. Thispro- depthsensors;ingeneral,anyglossyortranslucentsurface cesswasrepeateduntilasatisfactoryalignmentoftheren- isproblematic. Hence,someoftheseparts,suchastheplug deringswiththesceneimageswasachieved.Thefinalposes poles,werenotreconstructed. weredistributedtoalltestimageswiththeaidoftheknown ThereconstructedmodelswerealignedtotheCADmod- camera-to-turntablecoordinatetransformations. Thetrans- els using the ICP algorithm and the alignment was fur- formed poses are provided as the ground truth poses with ther refined manually. Models of both types are therefore eachtestimage. defined in the same coordinate system and the provided ground truth poses are valid for both of them. The origin 4.DesignValidationandExperiments ofthemodelcoordinatesystemcoincideswiththecenterof theboundingboxoftheCADmodel. This section presents an accuracy assessment of the The geometrical similarity of the two model types was ground truth poses and examines the difficulty of T-LESS assessed by calculating the average distance from vertices witharecent6Dlocalizationmethod. of the reconstructed models to the closest surface points 4.1.AccuracyoftheGroundTruthPoses of the corresponding CAD models. The average distance over all object models was found to be 1.01mm, which is Aiming to evaluate the accuracy of the ground truth very low compared to the size of objects that ranges from poses, we compared the captured depth images, after the 58.13mm for object 13 to 217.16mm for object 8. Dis- correctiondescribedinSec.3.4,withdepthimagesobtained tancesintheoppositedirection,i.e. fromtheCADmodels bygraphicallyrenderingthe3Dobjectmodelsattheground tothereconstructedmodels,arenotinformativesincesome truthposes. Ateachpixelwithavaliddepthvalueinboth CADmodelscontaininnerpartsthatarenotrepresentedin images, we calculated the difference δ = d −d , where c r the reconstructed models. The Metro software by Cignoni d is the captured and d is the rendered depth. Table 1 c r etal.[8]wasusedtomeasurethemodeldifferences. presents statistics of these differences, aggregated over all trainingandtestdepthimages. Differencesexceeding5cm Sensor,modeltype µδ σδ µ|δ| med|δ| and amounting to around 2.5% of the measurements were Carmine,CAD -0.60 8.12 4.53 2.57 considered to be outliers and were pruned before calculat- Carmine,reconst. -0.79 7.72 4.28 2.46 ing the statistics. The outlying differences may be caused Kinect,CAD 4.46 11.76 8.76 5.67 byerroneousdepthmeasurements,orbyocclusioninduced Kinect,reconst. 4.08 11.36 8.40 5.45 bydistractorobjectsinthecaseoftestimages. Table1. Statisticsofdifferencesbetweenthedepthofob- Therendereddepthsalignwellwiththedepthscaptured jectmodelsatthegroundtruthposesandthecaptureddepth by Carmine, as indicated by the mean difference µ being δ (inmm). µ andσ isthemeanandthestandarddeviation close to zero. In the case of Kinect, we observed that the δ δ ofthedifferences,µ andmed isthemeanandtheme- RGBanddepthimagesareslightlymisregistered,whichis |δ| |δ| dianoftheabsolutedifferences. the cause of the positive bias in µ . The average absolute δ differenceµ islessthan5mmforCarmineand9mmfor |δ| Kinect,whichisneartheaccuracyofthesensors[28]andis 100 relatively small compared to the size of objects. The error 80 statistics are slightly favorable for the reconstructed mod- %] els(asopposedtotheCADmodels),astheywereobtained all [ 60 c 40 e fromthecaptureddepthimagesandthereforeexhibitsimi- R 20 larcharacteristicsandartifacts. Forexample,theplugpoles 0 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930 are invisible to the RGB-D sensors and are missing in the Object ID reconstructedmodels,butarepresentintheCADmodels. 100 4.2.6DLocalization %] 80 The recent template-based method of Hodanˇ et al. [24] call [ 4600 e wasevaluatedonthe6Dlocalizationproblem. Theinputis R 20 comprisedofatestimagetogetherwiththeidentitiesofob- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 jectinstancesthatarepresentintheimage,andthegoalisto Scene ID estimatethe6Dposesoftheseinstances[23]. Themethod 100 wasevaluatedonalltestRGB-DimagesfromtheCarmine 80 steemnspolra.teTshweepraeragmeneeterarstewdefrreomsetthaes dtreasicnriinbgediminag[e2s4]f,rothme call [%] 4600 e R Carmine, andtheCADmodelswereemployedinthepose 20 refinement stage as detailed in [59]. Pose estimates were 0 0 20 40 60 80 100 evaluated as in [20], using the average distance error for Visible part of object surface [%] objects with indistinguishable views. This error measures Figure6. PerformanceofthemethodbyHodanˇ etal.[24] the misalignment between the surface of model M at the on the 6D localization problem. Shown are the recall per ground truth pose (R¯,¯t) and at the estimated pose (Rˆ,ˆt), object(top), recallperscene(middle), andrecallw.r.t. the andisdefinedas: percentageofthevisibleobjectsurface(bottom). (cid:13) (cid:13) e= avg min (cid:13)(cid:0)R¯x +¯t(cid:1)−(cid:0)Rˆx +ˆt(cid:1)(cid:13) . (cid:13) 1 2 (cid:13) x1∈Mx2∈M 2 Pose estimate (Rˆ,ˆt) is considered correct if e ≤ k·d, contains many similar objects and severe occlusions. Bot- wherek =0.1anddisthelargestdistancebetweenanypair tomofFig.6plotstherecall,accumulatedoverallobjects, ofmodelvertices,i.e. theobjectdiameter. Onlytheground as a function of the fraction of their image projection that truth poses at which at least 10% of the object surface is isunoccluded. Therecallincreasesproportionallywiththis visible were considered for the evaluation. The visibility fraction,illustratingthatocclusionisoneofthemainchal- wasestimatedasin[23]. lengesinT-LESS. Theperformanceismeasuredbyrecall,i.e.thepercent- The achieved mean recall over all objects is 67.2%, ageofthegroundtruthposesforwhichacorrectposewas which suggests a significant margin for improvement. We estimated. Fig. 6 presents achieved recall per object (top) notethatthesamemethodachievedameanrecallof95.4% and recall per scene (middle). The objects with the lowest on the dataset of Hinterstoisser et al. [20], which is close recallarethosethataresimilartootherobjects. Forexam- to the state of the art: [20] reports 96.6% and [5] reports ple,object1isoftenconfusedwithobject2,asareobjects 99.0% on this dataset. The latter is not directly compara- 20,21and22. Likewise,testscenescontainingsimilarob- ble since it was calculated only over 13 out of 15 objects jects are harder, with the hardest one being scene 20 that includedinthedataset. 5.Conclusion [9] A. Collet, M. Martinez, and S. S. Srinivasa. The MOPED framework: Objectrecognitionandposeestimationforma- ThispaperhaspresentedT-LESS,anewdatasetforeval- nipulation. IJRR,2011. uating 6D pose estimation of texture-less objects that can [10] K. Collins, A. Palmer, and K. Rathmill. The development facilitatesystematiccomparisonofpertinentmethods. The of a European benchmark for the comparison of assembly dataset features industry-relevant objects and is character- robotprogrammingsystems.InRobottechnologyandappli- ized by a large number of training and test images, accu- cations.1985. rate6Dgroundtruthposes,multiplesensingmodalities,test [11] N.Correll,K.E.Bekris,D.Berenson,O.Brock,A.Causo, scenes with multiple object instances and with increasing K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and difficultyduetoocclusionandclutter. Initialevaluationre- P.R.Wurman.LessonsfromtheAmazonpickingchallenge. sultsusingthedatasetindicatethatthestateoftheartin6D arXivpreprintarXiv:1601.05484,2016. objectposeestimationhasampleroomforimprovement. [12] A. Crivellaro, M. Rad, Y. Verdie, K. M. Yi, P. Fua, and V.Lepetit. Anovelrepresentationofpartsforaccurate3D TheT-LESSdatasetisavailableonlineat: objectdetectionandtrackinginmonocularimages.InICCV, cmp.felk.cvut.cz/t-less 2015. cvlab.epfl.ch/data/3d object tracking. [13] D. Damen, P. Bunnun, A. Calway, and W. Mayol-Cuevas. Acknowledgements Real-timelearninganddetectionof3Dtexture-lessobjects: Ascalableapproach. InBMVC,2012. ThisworkwassupportedbytheTechnologyAgencyof [14] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, theCzechRepublicresearchprogramTE01020415(V3C– and T.-K. Kim. Recovering 6D object pose and pre- VisualComputingCompetenceCenter),CTUstudentgrant dicting next-best-view in the crowd. In CVPR, 2016. www.iis.ee.ic.ac.uk/rkouskou/research/ SGS15/155/OHK3/2T/13, and the European Commission 6D NBV.html. FP7 DARWIN Project, Grant No. 270138. The help of [15] M. Firman. RGBD datasets: Past, present and future. Jan Pola´sˇek and Avgousta Hatzidaki in creating the CAD arXiv:1604.00999,2016. modelsisgratefullyacknowledged. [16] S. Foix, G. Alenya, and C. Torras. Lock-in time-of-flight (ToF)cameras:asurvey. SensorsJournal,2011. References [17] G. Georgakis, M. A. Reza, A. Mousavian, P.-H. Le, and [1] A. Aldoma, T. Fa¨ulhammer, and M. Vincze. Au- J. Kosecka. Multiview RGB-D dataset for object in- tomation of “ground truth” annotation for multi-view stance detection. arXiv preprint arXiv:1609.07826, 2016. RGB-D object instance recognition datasets. In IROS, cs.gmu.edu/˜robot/gmu-kitchens.html. 2014. repo.acin.tuwien.ac.at/tmp/permanent/ [18] R. Guo and D. Hoiem. Support surface prediction in in- dataset index.php. door scenes. In ICCV, 2013. ttic.uchicago.edu/ [2] A. Aldoma, F. Tombari, L. Di Stefano, and M. Vincze. A ˜rurtasun/rmrc/indoor.php. globalhypothesesverificationmethodfor3Dobjectrecog- [19] Y.Guo,M.Bennamoun,F.Sohel,M.Lu,J.Wan,andN.M. nition. In ECCV, 2012. users.acin.tuwien.ac.at/ Kwok.Acomprehensiveperformanceevaluationof3Dlocal aaldoma/datasets/ECCV.zip. featuredescriptors. IJCV,2016. [3] U. Bonde, V. Badrinarayanan, and R. Cipolla. Robust [20] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, instance recognition in presence of occlusion and clut- K. Konolige, and N. Navab. Model based training, detec- ter. In ECCV, 2014. sites.google.com/site/ tionandposeestimationoftexture-less3Dobjectsinheav- ujwalbonde/publications/downloads. ilyclutteredscenes.InACCV,2012.campar.in.tum.de/ [4] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shot- Main/StefanHinterstoisser. ton, and C. Rother. Learning 6D object pose estimation [21] S.Hinterstoisser,V.Lepetit,N.Rajkumar,andK.Konolige. using 3D object coordinates. In ECCV, 2014. cvlab- Goingfurtherwithpointpairfeatures. InECCV,2016. dresden.de/iccv2015-occlusion-challenge. [22] T.Hodanˇ,D.Damen,W.Mayol-Cuevas,andJ.Matas. Ef- [5] E.Brachmann,F.Michel,A.Krull,M.Y.Yang,S.Gumhold, ficient texture-less object detection for augmented reality and C. Rother. Uncertainty-driven 6D pose estimation of guidance. InISMARW,2015. objectsandscenesfromasingleRGBimage. 2016. [23] T.Hodanˇ,J.Matas,andSˇ.Obdrzˇa´lek. Onevaluationof6D [6] G. Bradski and A. Kaehler. Learning OpenCV: Computer objectposeestimation. InECCVWorkshoponRecovering visionwiththeOpenCVlibrary.O’ReillyMedia,Inc.,2008. 6DObjectPose,2016. [7] P.Cignoni,M.Callieri,M.Corsini,M.Dellepiane,F.Ganov- [24] T. Hodanˇ, X. Zabulis, M. Lourakis, Sˇ. Obdrzˇa´lek, and elli,andG.Ranzuglia. MeshLab:anopen-sourcemeshpro- J.Matas. Detectionandfine3Dposeestimationoftexture- cessingtool. InEurographicsItalianChapterConf.,2008. lessobjectsinRGB-Dimages. InIROS,2015. [8] P.Cignoni,C.Rocchini,andR.Scopigno.Metro:measuring [25] E. Hsiao and M. Hebert. Occlusion reasoning for ob- erroronsimplifiedsurfaces. InComputerGraphicsForum, ject detection under arbitrary viewpoint. TPAMI, 2014. volume17,pages167–174.WileyOnlineLibrary,1998. www.cs.cmu.edu/˜./hebert/occarbview.html. [26] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, [42] A.Singh, J.Sha, K.S.Narayan, T.Achim, andP.Abbeel. K. Saenko, and T. Darrell. A category-level 3D object BigBIRD:Alarge-scale3Ddatabaseofobjectinstances. In dataset: Putting the Kinect to work. In Consumer Depth ICRA,2014. rll.berkeley.edu/bigbird. CamerasforComputerVision.2013. kinectdata.com. [43] S. Song, S. P. Lichtenberg, and J. Xiao. Sun RGB-D: A [27] W.Kehl,F.Milletari,F.Tombari,S.Ilic,andN.Navab.Deep RGB-D scene understanding benchmark suite. In CVPR, learningoflocalRGB-Dpatchesfor3Dobjectdetectionand 2015. rgbd.cs.princeton.edu. 6Dposeestimation. InECCV,2016. [44] F. Steinbru¨cker, J. Sturm, and D. Cremers. Volumetric [28] K.KhoshelhamandS.Elberink. Accuracyandresolutionof 3D mapping in real-time on a CPU. In ICRA, 2014. Kinectdepthdataforindoormappingapplications. Sensors, github.com/tum-vision/fastfusion. 2012. [45] J.Sturm,N.Engelhard,F.Endres,W.Burgard,andD.Cre- [29] K.Lai,L.Bo,X.Ren,andD.Fox.Alarge-scalehierarchical mers. A benchmark for the evaluation of RGB-D SLAM multi-viewRGB-Dobjectdataset. InICRA,2011. rgbd- systems. InIROS,2012. dataset.cs.washington.edu. [46] B. Taati, M. Bondy, P. Jasiobedzki, and M. Greenspan. [30] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing Variable dimensional local shape descriptors for ob- IKEA Objects: Fine Pose Estimation. In ICCV, 2013. ject recognition in range data. In ICCV, 2007. ikea.csail.mit.edu. rcvlab.ece.queensu.ca/˜qridb/lsdPage.html. [31] M. Lourakis and X. Zabulis. Model-based pose estima- [47] A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim. tion for rigid objects. In Computer Vision Systems. 2013. Latent-classhoughforestsfor3Dobjectdetectionandpose www.ics.forth.gr/˜lourakis/posest. estimation. In ECCV, 2014. www.iis.ee.ic.ac.uk/ [32] D. Meger and J. J. Little. Mobile 3D object detection in rkouskou/research/LCHF.html. clutter.InIROS,2011.www.cs.ubc.ca/labs/lci/vrs. [48] G.Thu¨rrnerandC.A.Wu¨thrich. Computingvertexnormals [33] A.Mian,M.Bennamoun,andR.Owens. Ontherepeatabil- frompolygonalfacets. JournalofGraphicsTools,1998. ityandqualityofkeypointsforlocalfeature-based3Dobject [49] F.Tombari,L.DiStefano,andS.Giardino. Onlinelearning retrievalfromclutteredscenes. IJCV,2010. forautomaticsegmentationof3Ddata. In2011IEEE/RSJ [34] A. S. Mian, M. Bennamoun, and R. Owens. Three- InternationalConferenceonIntelligentRobotsandSystems, dimensional model-based object recognition and pages4857–4864.IEEE,2011. segmentation in cluttered scenes. TPAMI, 2006. [50] F.Tombari,A.Franchi,andL.DiStefano. BOLDfeatures staffhome.ecm.uwa.edu.au/˜00053650/ todetecttexture-lessobjects. InICCV,2013. recognition.html. [51] F.Tombari,S.Salti,andL.DiStefano.Uniquesignaturesof [35] F. Michel, A. Krull, E. Brachmann, M. Y. Yang, histogramsforlocalsurfacedescription. InECCV,2010. S. Gumhold, and C. Rother. Pose estimation of kine- [52] D. Wagner and D. Schmalstieg. ARToolKitPlus for pose matic chain instances via object coordinate regression. trackingonmobiledevices. InCVWW,2007. In BMVC, 2015. cvlab-dresden.de/iccv2015- articulation-challenge. [53] K. Walas and A. Leonardis. UoB highly occluded object [36] E.Mun˜oz,Y.Konishi,V.Murino,andA.D.Bue. Fast6D challenge II, 2016. www.cs.bham.ac.uk/research/ pose estimation for texture-less objects from a single RGB projects/uob-hooc. image. In ICRA, 2016. www.iit.it/datasets/vgm- [54] P.WohlhartandV.Lepetit. Learningdescriptorsforobject 6d-pose-of-texture-less-objects-dataset. recognitionand3Dposeestimation. InCVPR,2015. [37] C. Rennie, R. Shome, K. E. Bekris, and A. F. D. [55] W. Wohlkinger, A. Aldoma, R. B. Rusu, and M. Vincze. Souza. A dataset for improved RGBD-based ob- 3DNet:Large-scaleobjectclassrecognitionfromCADmod- ject detection and pose estimation for warehouse pick- els. In ICRA, 2012. repo.acin.tuwien.ac.at/tmp/ and-place. RA-L, 2016. www.pracsyslab.org/ permanent/3d-net.org. rutgers apc rgbd dataset. [56] Y. Xiang et al. ObjectNet3D: A large scale [38] R.Rios-CabreraandT.Tuytelaars. Discriminativelytrained database for 3D object recognition. In ECCV, 2016. templatesfor3Dobjectdetection: Arealtimescalableap- cvgl.stanford.edu/projects/objectnet3d. proach. InICCV,2013. [57] Y.Xiang,R.Mottaghi,andS.Savarese. BeyondPASCAL: [39] S. Salti, F. Tombari, and L. D. Stefano. SHOT: Unique Abenchmarkfor3Dobjectdetectioninthewild. InWinter signaturesofhistogramsforsurfaceandtexturedescription. ConferenceonApplicationsofComputerVision,2014. CVIU. www.vision.deis.unibo.it/research/80- [58] Z. Xie, A. Singh, J. Uang, K. S. Narayan, and P. Abbeel. shot. Multimodalblendingforhigh-accuracyinstancerecognition. [40] C.Schletteetal. Anewbenchmarkforposeestimationwith InIROS,2013. rll.berkeley.edu/2013 IROS ODP. ground truth from virtual reality. Production Engineering, [59] X.Zabulis,M.Lourakis,andP.Koutlemanis.3Dobjectpose 2014. www.mmi.rwth-aachen.de/exchange/data/ refinementinrangeimages. InICVS,2015. pesi2014/benchmark.htm. [41] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentationandsupportinferencefromRGBDimages. In ECCV, 2012. cs.nyu.edu/˜silberman/datasets/ nyu depth v2.html.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.