ebook img

Clustering High--Dimensional Data: First International Workshop, CHDD 2012, Naples, Italy, May 15, 2012, Revised Selected Papers PDF

157 Pages·2015·5.862 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Clustering High--Dimensional Data: First International Workshop, CHDD 2012, Naples, Italy, May 15, 2012, Revised Selected Papers

Francesco Masulli · Alfredo Petrosino Stefano Rovetta (Eds.) 7 Clustering High- 2 6 7 S Dimensional Data C N L First International Workshop, CHDD 2012 Naples, Italy, May 15, 2012 Revised Selected Papers 123 Lecture Notes in Computer Science 7627 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zürich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at http://www.springer.com/series/7409 Francesco Masulli Alfredo Petrosino (cid:129) Stefano Rovetta (Eds.) Clustering High- Dimensional Data First International Workshop, CHDD 2012 Naples, Italy, May 15, 2012 Revised Selected Papers 123 Editors Francesco Masulli StefanoRovetta DIBRIS DIBRIS University of Genoa University of Genoa Genoa Genoa Italy Italy Alfredo Petrosino University of Naples “Parthenope” Naples Italy ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notesin Computer Science ISBN 978-3-662-48576-7 ISBN978-3-662-48577-4 (eBook) DOI 10.1007/978-3-662-48577-4 LibraryofCongressControlNumber:2015950900 LNCSSublibrary:SL3–InformationSystemsandApplications,incl.Internet/Web,andHCI SpringerHeidelbergNewYorkDordrechtLondon ©Springer-VerlagBerlinHeidelberg2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynow knownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissionsthatmayhavebeenmade. Printedonacid-freepaper Springer-VerlagGmbHBerlinHeidelbergispartofSpringerScience+BusinessMedia (www.springer.com) Preface One of the most long-standing problems afflicting machine learning techniques is dataset dimensionality. Owing to the evolution of technologies for acquiring and creating information, however, this issue has recently become ubiquitous. In many applications to real-world problems, we deal with data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine orbiology, where DNA microarray technology andnext-generationsequencingcanproducealargenumberofmeasurementsatonce; theclusteringoftextdocuments,where,ifaword-frequencyvectorisused,thenumber of dimensions equals the size of the dictionary; and many others, including data integration and management, and social network analysis. In all these cases, the dimensionality of data makes learning problems hardly tractable. In particular, dimensionality is a highly critical factor for the clustering task. The following problems need to be addressed for clustering high-dimensional data: – Whenthedimensionality ishigh, thevolumeofthespace increases sofastthatthe available data become sparse, and we cannot find reliable clusters, as clusters are data aggregations (curse of dimensionality). – The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges (concen- tration effects). – Different clusters might be found in different subspaces, thus a global filtering of attributes is not sufficient (local feature relevance problem). – Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces. – High-dimensional data could likely include irrelevant features, which may obscure the effect of the relevant ones. This volume is the outcome of work done during the International Workshop on Clustering High-Dimensional Data, held at Istituto Italiano per gli Studi Filosofici, Palazzo Serra di Cassano, in Naples (Italy) on May 15, 2012, where speakers were subsequently invited to submit a paper related to their presentation. The papers collected here aim to present an updated view of many different approaches toward clustering high-dimensional data, and can be divided by topic into three groups. The first group introduces the general subject and issues of high-dimensional data clustering. Chapter 1 provides a general introduction, while Chapter 2 explores some properties of high-dimensional data that make it difficult to detect and even to define clusters. The second group of chapters presents examples of techniques used to find and investigate clusters in high dimensionality. Chapter 3 focuses on an approach to sub- space clustering; Chapter 4 presents a selection of dimensionality-independent VI Preface methods for comparing clusterings; and Chapter 5 deals with clustering high- dimensional time series. The third group deals with the most common approach to tackling dimensionality problems,namely,dimensionalityreductionanditsapplicationinclustering.Chapter6 introduces the topic of intrinsic dimensionality estimation, and Chapter 7 presents a specific technique for intrinsic dimensionality estimation. Chapter 8 compares four dimensionalityreductionmethodsforbinarydata,whilethelastcontribution,Chapter9, focusesondimensionalityreductionbyfeatureselectionusingrough-fuzzytechniques. July 2015 Francesco Masulli Alfredo Petrosino Stefano Rovetta Organization The International Workshop on Clustering High-Dimensional Data was organized as part of the Project “Clustering di dati ad alta dimensionalità” funded by the GNCS - Istituto Nazionale di Alta Matematica Francesco Severi (IN-dAM), in collaboration with the Istituto Italiano per gli Studi Filosofici (Naples, Italy), the Special Interest GroupinBioinformaticsandIntelligenceofINNS,theTaskForceonNeuralNetworks of IEEE-CIS-TCBB, the Department of Computer and Information Sciences of the University of Genoa (Italy), and the Department of Applied Science, University of Naples Parthenope (Italy). Workshop Chairs Francesco Masulli University of Genoa, Italy and Temple University, Philadelphia, USA Alfredo Petrosino University of Naples Parthenope, Italy Scientific Secretary Stefano Rovetta University of Genoa, Italy Event Management Hassan Mahmoud University of Genoa, Italy This workshop has been made possible thanks to funding from GNCS - Gruppo Nazionale per il Calcolo Scientifico of the INdAM - Istituto Nazionale di Alta Matematica Francesco Severi. We warmly thank the Istituto Italiano di Studi Filosofici for supporting the event. Contents Clustering High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Francesco Masulli and Stefano Rovetta What are Clusters in High Dimensions and are they Difficult to Find?. . . . . . 14 Frank Klawonn, Frank Höppner, and Balasubramaniam Jayaram Efficient Density-Based Subspace Clustering in High Dimensions. . . . . . . . . 34 Ira Assent Comparing Fuzzy Clusterings in High Dimensionality. . . . . . . . . . . . . . . . . 50 Stefano Rovetta and Francesco Masulli Time Series Clustering from High Dimensional Data. . . . . . . . . . . . . . . . . . 72 Carlo Drago and Germana Scepi Data Dimensionality Estimation: Achievements and Challanges . . . . . . . . . . 87 Francesco Camastra A Novel Intrinsic Dimensionality Estimator Based on Rank-Order Statistics. . . 102 S. Bassis, A. Rozza, C. Ceruti, G. Lombardi, E. Casiraghi, and P. Campadelli Dimensionality Reduction in Boolean Data: Comparison of Four BMF Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Eduard Bartl, Radim Belohlavek, Petr Osicka, and Hana Řezanková A Rough Fuzzy Perspective to Dimensionality Reduction . . . . . . . . . . . . . . 134 Alessio Ferone and Alfredo Petrosino Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Clustering High-Dimensional Data B Francesco Masulli1,2( ) and Stefano Rovetta1 1 Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi DIBRIS, Universit`a di Genova, Genova, Italy 2 Center for Biotechnology, Temple University, Philadelphia, USA [email protected] Abstract. Thischapterintroducesthetaskofclustering,concerningthe definitionofastructureaggregatingthedata,andthechallengesrelated toitsapplicationtotheunsupervisedanalysisofhigh-dimensionaldata. Intherecentliterature,manyapproacheshavebeenproposedforfacing thisproblem,asthedevelopmentofefficientclusteringmethodsforhigh- dimensional data is is a great challenge for Machine Learning as it is of vital importance to obtain safer decision-making processes and better decisions from the nowadays available Big Data, that can mean greater operational efficiency, cost reduction and risk reduction. 1 Introduction Clustering aims to find a structure that aggregates the data into some groups with the property that data belonging to a group (or cluster) are more similar to data in that cluster than to data in other clusters. With the beginning of the 21st Century, the decrease in the cost of storage and the increasing of the interest that is permeating the society toward the collection of data of all kinds, on scales unimaginable until recently, in most of the fields, ranging from science, to finance, to the Internet and mobile devices andsensors,hasresultedintheavailabilityoflarge,continuouslygrowingmasses of data (Big Data). Often, those data contain hyper-informative details about each observed instance. The problem of data clustering in high-dimensional data spaces has then become of vital interest for the analysis of those Big Data, to obtain safer decision-making processes and better decisions. This chapter is organized as follows: Sect.2 introduces the problem of clus- tering;Sect.3presentstheproblemofhigh-dimensionaldataanalysis;Somerele- vantapproacheshigh-dimensionaldataclusteringaresurveyedinSect.4;Sect.5 presents the conclusions. 2 Defining Clustering The concept of clustering dates back to at least the Greeks philosophers. Plato (∼400 BC), in his Statesman dialogue [15], introduces the approach of grouping objects based on their similar properties (categorization). This approach was (cid:2)c Springer-VerlagBerlinHeidelberg2015 F.Masullietal.(Eds.):CHDD2012,LNCS7627,pp.1–13,2015. DOI:10.1007/978-3-662-48577-41

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.