INVITED PAPER Content-Based Music Information Retrieval: Current Directions and Future Challenges Current retrieval systems can handle tens-of-thousands of music tracks but new systems need to aim at huge online music collections that contain tens-of-millions of tracks. ByMichael A. Casey, Member IEEE, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe Rhodes, and Malcolm Slaney, Senior Member IEEE ABSTRACT|ThesteepriseinmusicdownloadingoverCDsales fingerprinting, content-based music retrieval) and other cues has created a major shift in the music industry away from (e.g., music notation and symbolic representation), and physical media formats and towards online products and identifiessomeofthemajorchallengesforthecomingyears. services. Music is one of the most popular types of online information and there are now hundreds of music streaming KEYWORDS | Audio signal processing; content-based music and download services operating on the World-Wide Web. informationretrieval;symbolicprocessing;userinterfaces Some of the music collections available are approaching the scaleoftenmilliontracksandthishasposedamajorchallenge I. INTRODUCTION for searching, retrieving, and organizing music content. Researcheffortsinmusicinformationretrievalhaveinvolved Music is now so readily accessible in digital form that experts from music perception, cognition, musicology, engi- personal collections can easily exceed the practical limits neering,andcomputerscienceengagedintrulyinterdisciplin- on the time we have to listen to them: ten thousand aryactivitythathasresultedinmanyproposedalgorithmicand music tracks on a personal music device have a total methodologicalsolutionstomusicsearchusingcontent-based duration of approximately 30 days of continuous audio. methods. This paper outlines the problems of content-based Distribution of new music recordings has become easier, music information retrieval and explores the state-of-the-art prompting a huge increase in the amount of new music methods using audio cues (e.g., query by humming, audio thatisavailable.In2005,therewasathree-foldgrowthin legal music downloads and mobile phone ring tones, worth$1.1billionworldwide,offsettingtheglobaldecline in CD sales; and in 2007, music downloads in the U.K. ManuscriptreceivedSeptember3,2007;revisedDecember5,2007.Thiswork wassupportedinpartbytheU.K.EngineeringandPhysicalSciencesResearch reached new highs [1]–[3]. CouncilunderGrantsEPSRCEP/E02274X/1andEPSRCGR/S84750/01. Traditionalwaysoflisteningtomusic,andmethodsfor M.A.CaseyandC.RhodesarewiththeDepartmentofComputing,Goldsmiths College,UniversityofLondon,SE146NWLondon,U.K.(e-mail:[email protected]; discovering music, such as radio broadcasts and record [email protected]). stores,arebeingreplacedbypersonalizedwaystohearand R.VeltkampiswiththeDepartmentofInformationandComputingSciences,Utrecht University,3508TBUtrecht,TheNetherlands(e-mail:[email protected]). learn about music. For example, the advent of social net- M.GotoiswiththeNationalInstituteofAdvancedIndustrialScienceandTechnology working Web sites, such as those reported in [4] and [5], (AIST),Ibaraki305-8568,Japan(e-mail:[email protected]). M.LemaniswiththeDepartmentofArt,Music,andTheaterSciences, has prompted a rapid uptake of new channels of music GhentUniversity,9000Ghent,Belgium(e-mail:[email protected]). discoveryamongonlinecommunities,changingthenature M.SlaneyiswithYahoo!ResearchInc.,SantaClara,CA95054USA (e-mail:[email protected]). ofmusicdisseminationandforcingthemajorrecordlabels DigitalObjectIdentifier:10.1109/JPROC.2008.916370 to rethink their strategies. 668 Proceedings of the IEEE|Vol.96,No.4,April2008 0018-9219/$25.00 (cid:1)2008 IEEE Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges It is not only music consumers who have expectations approximately 50 person-years to enter the metadata for of searchable music collections; along with the rise of one million tracks. consumeractivityindigitalmusic,therearenewopportu- Social media web services address the limitations of nities for research into using large music collections for centralized metadata by opening the task of describing discovering trends and patterns in music. Systems for contenttopubliccommunitiesofusersandleveragingthe trend spotting in online music sales are in commercial power of groups to exchange information about content. development [6], as are systems to support musicological ThisisthehallmarkofWeb2.0.Withmillionsofusersof research into music evolution over the corpus of Western portals such as MySpace, Flickr, and YouTube, group be- classicalmusicscoresandavailableclassicalrecordings[7], havior means that users naturally gravitate towards those [8].Musicologyresearchaimstoanswerquestionssuchas: partsoftheportalVcategoriesorgroupsVwithwhichthey which musical works and performances have been shareanaffinity;sotheyarelikelytofinditemsofinterest historically the most influential? indexedbyuserswithsimilartastes.However,theactivity Strategiesforenablingaccesstomusiccollections,both on social networking portals is not uniform across the newandhistorical,needtobedevelopedinordertokeep interests of society and culture at large, being predomi- up with expectations of search and browse functionality. nantly occupied by technologically sophisticated users, These strategies are collectively called music information therefore social media is essentially a type of editorial retrieval (MIR) and have been the subject of intensive metadata process. researchbyanever-increasingcommunityofacademicand In addition to metadata-based systems, information industrial research laboratories, archives, and libraries. about the content of music canbe used tohelp users find There are three main audiences that are identified as the music. Content-based music description identifies what beneficiaries of MIR: industry bodies engaged in record- the user is seeking even when he does not know speci- ing, aggregating and disseminating music; end users who fically what he is looking for. For example, the Shazam want to find music and use it in a personalized way; and system (shazam.com), described in [10], can identify a professionals: music performers, teachers, musicologists, particular recording from a sample taken on a mobile copyright lawyers, and music producers. phone in a dance club or crowded bar and deliver the At present, the most common method of accessing artist,album,andtracktitlealongwithnearbylocationsto music is through textual metadata. Metadata can be rich purchase the recording or a link for direct online pur- and expressive so there are many scenarios where this chasing and downloading. Users with a melody but no approachissufficient.Mostmusicdownloadservicescur- other information can turn to the online music service rently use metadata-only approaches and have reached a Nayio (nayio.com) which allows one to sing a query and degree of success with them. However, when catalogues attempts to identify the work. become very large (greater than a hundred thousand In the recording industry, companies have used sys- tracks) it is extremely difficult to maintain consistent tems based on symbolic information about musical con- expressive metadata descriptions because many people tent,suchasmelody,chords,rhythm,andlyrics,toanalyze created the descriptions and variation in concept encod- thepotentialimpactofaworkinthemarketplace.Services ings impacts search performance. Furthermore, the such as Polyphonic HMI’s Hit Song Science and Platinum descriptions represent opinions, so editorial supervision BlueMusicIntelligenceusesuchsymbolicinformationwith of the metadata is paramount [9]. techniques from Artificial Intelligence to make consulta- An example of a commercial metadata-driven music tive recommendations about new releases. systemispandora.comwheretheuserispresentedwiththe Because there are so many different aspects to music instructionBtypeinthenameofyourfavoriteartistorsong information retrieval and different uses for it, we cannot and we’ll create a radio station featuring that music and addressallofthemhere.Thispaperaddressessomeofthe more like it.[The system usesmetadata to estimate artist recent developments in content-based analysis and re- similarity and track similarity; then, it retrieves tracksthat trievalofmusic,payingparticularattentiontothemethods theusermightwanttohearinapersonalizedradiostation. by which important information about music signals and Whereasthequerywassimpletopose,findingtheanswer symbols can be automatically extracted and processed for iscostly.Thesystemworksusingdetailedhuman-entered usewithmusicinformationretrievalsystems.Weconsider track-level metadata enumerating musical-cultural prop- bothaudiorecordingsandmusicalscores,asitisbeneficial erties for each of several hundred thousand tracks. It is tolookatboth,whentheyareavailable,tofindcluesabout estimated that it takes about 20–30 minutes per track of what a user is looking for. one expert’s time to enter the metadata.1 The cost is The structure of this paper is as follows: Section II thereforeenormousinthetimetakentoprepareadatabase introducesthetypesoftasks,methods, and approachesto to contain all the information necessary to perform evaluation of content-based MIR systems; Section III similarity-based search. In this case, it would take presents methods for high-level audio music analysis; Section IV discusses audio similarity-based retrieval; 1http://www.pandora.com/corporate. symbolic music analysis and retrieval are presented in Vol.96,No.4,April2008|Proceedings of the IEEE 669 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges of music, such as genre, have low specificity: that is, a search given a query track will return tracks having little contentdirectlyincommonwiththequery,butwithsome global characteristics that match. Hence, we divide specificity into three broad categories: high-specificity systems match instances of audio signal content; mid- specificity systems match high-level music features, such as melody, but do not match audio content; and low- specificity systems match global (statistical) properties of thequery.Table1enumeratessomeofthe MIRusecases and their specificities (high, mid, or low). A more comprehensive list of tasks and their specificities is given in [11]. There are three basic strategies for solving MIR use cases.Eachstrategyissuitedtoagivenspecificity.Thefirst is based on conceptual metadata, information that is encoded and searched like text and is suited to low- specificity queries; the second approach uses high-level descriptionsofmusiccontentcorrespondingwithintuitive or expert knowledge about how a piece of music is constructed. This approach is suited to mid specificity Fig.1.Flowchartofcanonicalcontent-basedquerysystem. queries. The third strategy is based on low-level signal- based properties which are used for all specificities. We outline each of these three approaches in the following sections. SectionV;SectionVIpresentsanoverviewofadvancesin music visualization and browsing; Section VII discusses systems for music information retrieval; and we conclude Table1ExamplesofMIRTasksandTheirSpecificities in Section VIII with a discussion of challenges and future directions for the field. II. USES CASES AND APPROACHES A. Use Cases: Specificities and Query Types Content-basedMIRisengagedinintelligent,automat- ed processing of music. The goal is to make music, or information about music, easier to find. To support this goal, most MIR research has been focused on automatic musicdescriptionandevaluationoftheproposedmethods. Thefieldisorganizedaroundusecaseswhichdefineatype of query, the sense of match, and the form of the output. Queriesandoutputcanbetextualinformation(metadata), music fragments, recordings, scores, or music features. The match can be exact, retrieving music with specific content, or approximate, retrieving near neighbors in a musicalspacewhereproximityencodesmusicalsimilarity, for example. The main components of an MIR system are detailed in Fig. 1. These are query formation, description extrac- tion,matchingand,finally,musicdocumentretrieval.The scope of an MIR system can be situated on a scale of specificityforwhichthequerytypeandchoiceofexactor approximatematchingdefinethecharacteristicspecificity ofthesystem.Thosesystemsthatidentifyexactcontentof individual recordings, for example, are called high- specificity systems; those employing broad descriptions 670 Proceedings of the IEEE|Vol.96,No.4,April2008 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges B. Approaches systemscurrentlyrelyheavilyonmetadatabutarenotable to easily provide their users with search capabilities for 1) Metadata: MetadataisthedriverofMIRsystems.As finding music they do not already know about, or do not such, many services exist simply to provide reliable know how to search for. This gap is one of the oppor- metadata for existing collections of music, either for end tunities for content-based methods, which hold the pro- users or for large commercial music collections. Most miseofbeingabletocomplementmetadata-basedmethods musiclistenersusemetadataintheirhomelisteningenvi- and give users access to new music via processes of self- ronments. A common MIR task is to seek metadata from directed discovery and musical search that scales to the theinternetfordigitalmusicthatwasBripped[fromcom- totalityofavailablemusictracks.Fortheremainderofthis pact disc to a computer. This is the core functionality of paper, we focus primarily on content-based music automaticWeb-basedservicessuchasGracenote(gracenote. description rather than factual or culturally determined com) and MusicBrainz (musicbrainz.org)Vboth of these parameters. However, content-based methods are consid- metadata repositories rely on user-contributed content to ered not replacements but enhancements for metadata- scaletrack-levelmusicdescriptiontomillionsofentries. based methods. These services provide both factual metadata, namely objectivetruthsaboutatrack,andculturalmetadata,which 2) High-Level Music Content Description: An intuitive contains subjective concepts. For a metadata system to starting point for content-based music information work its descriptions of music must be accurate and the retrieval is to use musical concepts such as melody or meaning of the metadata vocabulary widely understood. harmonytodescribethecontentofthemusic.Intheearly Web2.0providesapartialsolutioninthatcommunitiesof days of MIR, many query-by-humming systems were users can vote on a track’s metadata. This democratic proposed that sought to extract melodic content from process at least ensures that the metadata for a track is polyphonic audio signals (those with multiple simulta- consistent with the usage of one, or more, communities. neousmusicallines)sothatausercouldsearchformusic Problems associated with factual information, artist, by singing or humming part of the melody; such systems album, year of publication, track title, and duration, can are now being deployed as commercial services; see, for severely limit the utility of metadata. Ensuring the example, naiyo.com. A survey of sung-query methods was generality of the associated text fields, for example, conducted by Hu and Dannenberg in [13]. consistencies of spelling, capitalization, international High-levelintuitiveinformationaboutmusicembodies characters, special characters and order of proper names, thetypesofknowledgethatasophisticatedlistenerwould is essential to useful functioning [9]. haveaboutapieceofmusic,whetherornottheyknowthey In addition to factual metadata, subjective, culturally have that knowledge: determined information at the level of the whole track is often used to retrieve tracks. Common classes of such BItismelodythatenablesustodistinguishonework metadataaremood,emotion,genre,style,andsoforth.Most from another. It is melody that human beings are current music services use a combination of factual and innatelyabletoreproducebysinging,humming,and cultural metadata. There has also been much interest in whistling.Itismelodythatmakesmusicmemorable: automatic methods for assigning cultural, and factual, we are likely to recall a tune long after we have metadata to music. Some services collect user preference forgotten its text.[ [14] data, such as the number of times particular tracks have been played, and use the information to make new music Even though it is an intuitive approach, melody recommendations to users based on the user community. extraction from polyphonic recordings, i.e., multiple For example, Whitman and Rifkin [12] used music instruments playing different lines simultaneously, re- descriptions generated from community metadata; they mainsextremelydifficulttoachieve.Surprisingly,itisnot achieved Internet-wide description by using data mining only difficult to extract melody from audio but also from andinformationretrievaltechniques.Theirextracteddata symbolic representations such as MIDI files. The same is was time awareVreflecting changes both in the artists’ true of many other high-level music concepts such as styleandinthepublic’sperceptionoftheartists.Thedata rhythm, timbre, and harmony. Therefore, extraction of wascollectedweekly,andlanguageanalysiswasperformed high-level music content descriptions is a subgoal of MIR to associate noun and verb phrases with musical features and the subject of intensive research. Common high-level extracted from audio of each artist. descriptorsareidentifiedinTable2.Thegoalofsuchtasks The textual approach to MIR is a very promising new is to encode music into a schema that conforms to direction in the field: a comprehensive review of the traditional Westernmusicconceptsthat can thenbe used methods and results of such research is beyond the scope to make queries and search music. of the current paper. Automaticextractionoffactual,cultural,andhigh-level Forallitsutility,metadatacannotsolvetheentiretyof musicdescriptionshavebeenasubjectofintensestudyin MIR due to the complexities outlined above. Commercial the MIREX music information retrieval experimental Vol.96,No.4,April2008|Proceedings of the IEEE 671 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges Table2High-LevelMusicFeatures(HardtoExtract) audio matching methods rather than extracting a high- levelmelodyfeaturefromaudio[18].Theauthorssuggest that low-level audio methods outperform symbolic meth- ods even when clean symbolic information is available as in this task. Because there is a great number of music recordings available that can be used as a first stage input to a high- level music description system, this motivates work on extracting high-level music features from low-level audio content. The MIREX community extends the range of tasks that are evaluated each year, allowing for valuable knowledgetobegainedonthelimitsofcurrentalgorithms and techniques. 3) Low-Level Audio Features: The third strategy for content-based music description isto use the information inthedigitalaudio.Low-levelaudiofeaturesaremeasure- ments of audio signals that contain information about a musical work and music performance. They also contain extraneous information due to the difficulty of precisely measuring just a single aspect of music, so there is a tradeoffbetweenthesignal-leveldescriptionandthehigh- level music concept that is encoded. In general, low-level audio features are segmented in threedifferentways:framebasedsegmentations(periodic sampling at 10 ms-1000 ms intervals), beat-synchronous exchange. MIREX provides a framework for formal segmentations (features are aligned to musical beat evaluation of MIR systems using centralized tasks, boundaries), and statistical measures that construct datasets, platforms, and evaluation methods [15], [16]. probability distributions out of features (bag of features Assuch,MIREXhasbecomeaveryimportantindicatorof models). Many low-level audio features are based on the the state of the art for many subtasks within the field of short-time spectrum of the audio signal. Fig. 2 illustrates MIR. A summary of the results of the 2007 high-level howsomeofthemostwidelyusedlow-levelaudiofeatures musictasksisgiveninTable3;thequery-by-hummingtask are extracted from a digital audio music signal using a has a particularly high score. It is interesting to note that windowed fast Fourier transform (FFT) as the spectral the best-performing system on this task used low-level extractionstep.Bothframe-basedandbeat-basedwindows are evident in the figure. a) Short-Time Magnitude Spectrum: Many low-level Table3SummaryofResultsofBest-PerformingClassificationand audio features use the magnitude spectrum as a first step RecognitionSystemsinMIREX2007Exchange forfeatureextractionbecausethephaseofthespectrumis not as perceptually salient for music as the magnitude. Thisisgenerallytrueexceptinthedetectionofonsetsand in phase continuation for sinusoidal components. b)Constant-Q/MelSpectrum:Theear’sresponsetoan acoustic signal is logarithmic in frequency and uses nonuniform frequency bands, known as critical bands, to resolve close frequencies into a single band of a given center frequency. Many systems represent the constant bandwidth critical bands using a constant-Q transform, where the Q is the ratio of bandwidth to frequency [17], [19].Itistypicaltousesomedivisionofthemusicaloctave for the frequency bands, such as a twelfth, corresponding toonesemitoneinWesternmusic,butitisalsocommonto use more perceptually motivated frequency spacing for bandcenters.Fig.3showsthealignmentbetweenasetof linearly spaced frequency band edges and the corre- spondinglogarithmicallyspacedtwelfth-octavebands.The 672 Proceedings of the IEEE|Vol.96,No.4,April2008 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges Fig.2.Schematicofcommonaudiolow-levelfeatureextractionprocesses.Fromleft-to-right:log-frequencychromagram, Mel-frequencycepstralcoefficients,linear-frequencychromagram,andbeattracking.Insomecases,thebeattrackingprocessis usedtomakethefeaturesbeatsynchronous,otherwisesegmentationusesfixed-lengthwindows. Melfrequencyscalehaslinearlyspacedfiltersinthelower such as 24, 36, or 48 bands. Tuning systems that use frequency range and logarithmically spaced filters above equally spaced pitch classes are called equal temperament. 1300 Hz. Both logarithmic and Mel frequency scales are Recently, some studies have explored extracting features used. The Mel or Constant-Q spectrum can be obtained for tuning systems that do not use equally spaced pitch from a linear spectrum by summing the powers in adja- classes: a necessary extension for application to non- centfrequencybands.Thisapproachhastheadvantageof Western music. being able to employ the efficient FFT to compute the d) Onset Detection: Musicaleventsaredelineatedby spectrum. onsets;anotehasanattackfollowedbysustainanddecay c) Pitch-Class Profile (Chromagram): Another com- portions. Notes that occur simultaneously in music are mon type of frequency folding is used to represent the often actually scattered in time and the percept is inte- energy due to each pitch class in twelfth-octave bands grated by the ear-brain system. Onset detection is con- called a pitch-class profile (PCP) [20]–[22]. This feature cernedwithmarkingjustthebeginningsofnotes.Thereare integratestheenergyinalloctavesofonepitchclassintoa several approaches to onset detection, employing spectral single band. There are 12 equally spaced pitch classes in differences in the magnitude spectrum of adjacent time Western tonal music, independent of pitch height, so points, or phase differences in adjacent time points, or there are typically 12 bands in a chromagram represen- some combination of the two (complex number onset tation. Sometimes, for finer resolution of pitch informa- detection) [23]–[25]. Onset detection is one of the tasks tion, the octave is divided into an integer multiple of 12 studiedintheMIREXframeworkofMIRevaluationshown inTable3. e) Mel/Log-Frequency Cepstral Coefficients: Mel- frequency cepstral coefficients (MFCC) take the loga- rithmoftheMelmagnitudespectrumanddecorrelatethe resulting values using a Discrete Cosine Transform. This is a real-valued implementation of the complex cepstrum in signal processing [19]. The effect of MFCCs is to organize sinusoidal modulation of spectral magnitudes by Fig.3.Foldingofasetoflinearlyspacedfrequencybands increasing modulation frequency in a real-valued array. (lowergraph)ontoasetoflogarithmicallyspacedfrequencybands Values at the start of the array correspond to long wave (uppergraph).Thex-axisshowsfrequencyofbands,upperlines spectralmodulationandthereforerepresenttheprojection arelabeledbytheirWesternmusicnotationpitchclassandlowerlines bytheirFFTbinnumber(for16384binswith44.1kHzsamplerate). of the log magnitude spectrum onto a basis of formant Vol.96,No.4,April2008|Proceedings of the IEEE 673 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges peaks. Values at the end of the MFCC array are the features provides a measurement for every beat interval, projectioncoefficientsofthelogmagnitudespectrumonto ratherthanattheframelevel,sothelow-levelfeaturesare short wavelength spectral modulation, corresponding to segmented by musically salient content. This has recently harmonic components in the spectrum. It is usual to use proven to be exceptionally useful for mid-specificity MIR about 20 MFCC coefficients, therefore representing the taskssuchascoversongsandversionsidentification.Low- formant peaksofaspectrum;thisextraction corresponds, level audio features in themselves cannot tell us much inpart,tomusicaltimbreVthewaytheaudiosoundsother about music; they encode information at too fine a tem- than its pitch and rhythm. poralscaletorepresentperceptuallysalientinformation.It f)SpectralFlux:The spectral flux of a musical signal isusualinMIRresearchtocollectaudioframesintooneof estimates the fine spectral-temporal structure in different several aggregate representations. Table 4 describes this frequency bands by measuring the modulation ampli- second-stage processing of low-level audio features which tudes in mid-to-high spectral bands [26], [27]. The encodes more information than individual audio frames. resulting feature is a two-dimensional matrix, with fre- An aggregate feature is ready for similarity measurement quency bands in the rows and modulation frequency in whereasindividuallow-levelaudiofeatureframesarenot. columns, representing the rate of change of power in The chosen time scale for aggregate features depends on each spectral band. the specificity and temporal acuity of the task. g) Decibel Scale (Log Power): The decibel scale is These low-level audio features and their aggregate employed for representing power in spectral bands be- representations are used as the first stage in bottom-up cause the scale closely represents the ear’s response. The processing strategies. The task is often to obtain a high- decibel scale is calculated as ten times the base-10 loga- level representation of music as a next step in the rithm of power. processing of music content. The following section gives h) Tempo/Beat/Meter Tracking: As shown in Fig. 2, a summary of some of the approaches to bridging the beatextractionfollowsfromonsetdetection,anditisoften gap between low-level and high-level music tasks such used to align the other low-level features. Alignment of as these. Fig.4.Beattrackingamusicfilewithbeatroottool.Lowerportionofgraphshowsaudiopowerandupperportionshowsspectrogram. Darkverticallinesarebeatpositions.Numbersrunningalongtopofgraphareinter-beatintervalsinmilliseconds. 674 Proceedings of the IEEE|Vol.96,No.4,April2008 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges Fig.5.Predominantf0trajectoryofCharlieHadenjazztrackshowninSonicVisualiser.Darkerregionsaremaximaofthistimeversus logarithmicfrequencyspectrogram.Verticalaxisislaidoutasapianokeyboardformusicalpitchreference. III. AUDIO ANALYSIS (in time), and vertically (in frequency). Furthermore, the information in music is constructed with hierarchical IncontrasttospeechrecognitionandtextIRsystems,most schemas. Systems for analyzing and searching music musichasseveralstreamsofrelatedinformationoccurring content must seek to represent many of these viewpoints in parallel. As such, music is organized both horizontally simultaneously to be effective. As discussed in Section II-B, one of the intermediate Table4FrameAggregationMethodsforLow-LevelAudioFeatures goals of MIR is to extract high-level music content descriptionsfromlow-levelaudioprocesses.Thefollowing sections describe research into extracting such high-level descriptions with a view to transforming musical audio contentintorepresentationsthatareintuitiveforhumans to manipulate and search. We begin the discussion with the related high-level music description problems of beat tracking, tempo estimation, and meter tracking. A. Beat Tracking Automatic estimation of the temporal structure of music,suchasmusicalbeat,tempo,rhythm,andmeter,is notonlyessentialforthecomputationalmodelingofmusic understanding but also useful for MIR. Temporal pro- perties estimated from a musical piece can be used for content-based querying and retrieval, automatic classi- fication,musicrecommendation,andplaylistgeneration.If thetempoofamusicalpiececanbeestimated,forexample, Vol.96,No.4,April2008|Proceedings of the IEEE 675 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges it is easy to find musical pieces having a similar tempo possible period, to the time-varying degree of musical without using any metadata. Once the musical beats are accentuation[31]. estimated, we can use them as the temporal unit for Foraudio-basedbeattracking,itisessentialtosplitthe high-level beat-based computation instead of low-level full frequency band of the input audio signal into several frame-based computation. This facilitates the estimation frequency subbands and calculate periodicities in each of other musical descriptions, such as the music struc- subband. Goto [30] proposed a method where the beat- ture and chorus sections [22]. Since the estimated beats period analysis is first performed within seven logarith- can also be used for normalizing the temporal axis of micallyequallyspacedsubbandsandthoseresultsarethen musical pieces, the beat-based time alignment facilitates combined across the subbands by using a weighted sum. time-scale invariant music identification or cover song Scheirer[31]alsousedtheideaofthissubband-basedbeat identification [28], [29]. tracking and applied aset ofcomb-filterresonatorstothe Here, we define beat tracking (including measure or degreesofmusicalaccentuationofsixsubbandstofindthe bar-line estimation) as the process of organizing musical most resonant period. To locate periodicity in subband audio signals into a hierarchical beat structure [30]. The signals, Sethares and Staley [32] used a periodicity typical beat structure comprises the quarter-note level transform instead of using comb-filter resonators. (the tactus level represented as almost regularly spaced After estimating the beat period, the phase should be beat times) and the measure level (bar lines). The basic estimated. When onset times are used to estimate the natureoftrackingthequarter-notelevelisrepresentedby period, a windowed cross-correlation function is applied two parameters. betweenanonset-timesequenceandatentativebeat-time Period: The period is the temporal difference between sequence whose interval is the estimated period. The the times of two successive beats. The tempo (beats per result can be used to predict the next beat in a real-time minute) is inversely proportional to the period. beat-trackingsystem.Ontheotherhand,whenthedegrees Phase: The phase corresponds to actual beat positions ofmusicalaccentuationare used,the internal stateofthe and equals zero at beat times. delaysofcomb-filterresonatorsthathavelatticesofdelay- Ontheotherhand,themeasurelevelisdefinedonbeat and-hold stages can be used to determine the phase [31]. timesbecausethebeatstructureishierarchical:thebegin- Toestimatetheperiodandphasesimultaneously,thereare nings of measures (bar-line positions) coincide with beat other approaches using adaptive oscillators [33]. times. The difficulty of beat tracking depends on how explicitly the beat structure is expressed in the target 2) Dealing With Ambiguity: The intrinsic reason that music: it depends on temporal properties such as tempo beattrackingisdifficultisduetotheproblemofinferring changes and deviations, rhythmic complexity, and the an original beat structure that is not expressed explicitly. presence of drum sounds. This causes various ambiguous situations, such as those where different periods seem plausible and where several 1)TrackingMusicalBeats(Quarter-NoteLevel):Thebasic onset times obtained by frequency analysis may corre- approachofestimatingtheperiodandphaseofthequarter- spond to a beat. note(tactus)levelistodetectonsettimesandusethemas There are variations in how ambiguous situations in cues. Many methods assume that a frequently occurring determiningthebeatstructurearemanaged.Atraditional inter-onset interval (IOI), the temporal difference be- approachistomaintainmultiplehypotheses,eachhavinga tween two onset times, is likely to be the beat period and different possible set of period and phase. A beam search that onset times tend to coincide with beat times (i.e., technique or multiple-agent architectures [30], [34] have sounds are likely to occur on beats). beenproposed tomaintain hypotheses. Amore advanced, To estimate the beat period, a simple technique is to computationally intensive approach for examining multi- calculatethehistogramofIOIsbetweentwoadjacentonset ple hypotheses is to use probabilistic generative models timesorclustertheIOIsandpickoutthemaximumpeakor and estimate their parameters. Probabilistic approaches thetoprankedclusterwithinanappropriatetemporange. with maximum likelihood estimation, MAP estimation, This does not necessarily correspond to the beat period, and Bayes estimation could maintain distributions of all though. A more sophisticated technique is to calculate a parameters, such as the beat period and phase, and find windowed autocorrelation function of an onset-time se- the best hypothesis as if all possible pairs of the period quence,powerenvelope,orspectralfluxoftheinputsignal, and phase were evaluated simultaneously. For example, or continuous onset representation with peaks at onset Hainsworth and Macleod [35] explored the use of par- positions, and pick out peaks in the result. This can be ticle filters for audio-based beat tracking on the basis of considered an extended version of the IOI histogram be- MIDI-based methods by Cemgil and Kappen [36]. They cause it naturally takes into account various temporal dif- made use of Markov chain Monte Carlo (MCMC) algo- ferences such as those between adjacent, alternate, and rithms and sequential Monte Carlo algorithms (particle everythirdonsettimes.Anothersophisticatedtechniqueis filters) to estimate model parameters, such as the period to apply a set of comb-filter resonators, each tuned to a and phase. 676 Proceedings of the IEEE|Vol.96,No.4,April2008 Casey etal.:Content-Based Music Information Retrieval:Current Directions and Future Challenges 3) Estimating Beginnings of Measures (Measure Level): becausestereosignalscanbeeasilyconvertedtomonaural Higher level processing using musical knowledge is ne- signals. While a method depending on stereo information cessarytodeterminethemeasurelevelofthehierarchical cannotbeappliedtomonauralsignals,amethodassuming beat structure. Musical knowledge is also useful for monaural signals can be applied to stereo signals and can selecting the best hypothesis in the above ambiguous be considered essential to music understanding, since situations. human listeners have no difficulty understanding melody Variouskindsofmusicalknowledgehavebeenstudied. and bass lines even from monaural signals. Goto [30] used musical knowledge concerning chord Here, melody and bass lines are represented as a con- changesanddrumpatternswithafocusonpopularmusic. tinuous temporal-trajectory representation of fundamen- By estimating the chord changes using signal processing tal frequency (F0, perceived as pitch) or a series of technique,forexample,thebeginningsofmeasurescanbe musical notes. It is difficult to estimate the F0 of melody determined so that chords are more likely to change on and bass lines in monaural polyphonic sound mixtures thosepositions.Ontheotherhand,Klapurietal.[38]used containing simultaneous sounds of various instruments, musical knowledge concerning temporal relationship because in the time-frequency domain the frequency among different levels of the hierarchical beat structure components of one sound often overlap the frequency and encoded this prior knowledge in HMMs that could componentsofsimultaneoussounds.Evenstate-of-the-art jointlyestimateperiodsatdifferenthierarchicallevelsand technologies cannot fully separate sound sources and then separately estimate their phases. transcribe musical scores from complex polyphonic mixtures. Most melody and bass estimation methods 4) Conclusion: The topic of beat tracking still attracts therefore do not rely on separated sounds or transcribed many researchers because it includes fundamental issues scores but directly estimate the target melody and bass inunderstandingtemporalaspectsofmusic,contributesto lines from music that has distinct melody and bass lines, a number of practical applications, and it is difficult to such as popular songs. achieve perfect beat tracking for various kinds of music. Therefore, new approaches are proposed every year, 1) Estimating Melody and Bass Lines by Finding the including holistic beat tracking [39] where information Predominant F0 Trajectory: Since the melody line tends to aboutmusicstructureestimatedbeforebeattrackinghelps havethemostpredominantharmonicstructureinmiddle- totrackbeatsbyaddingaconstraintthatsimilarsegments andhigh-frequencyregionsandthebasslinetendstohave of music should have corresponding beat structure. the most predominant harmonic structure in a low- frequencyregion,thefirstclassicideaofestimatingmelody B. Melody and Bass Estimation andbasslinesistofindthemostpredominantF0insound Automatic estimation of melody and bass lines is mixtures with appropriate frequency-range limitation. In important because the melody forms the core of Western 1999, Goto [43], [44] proposed a real-time method called musicandisastrongindicatorfortheidentityofamusical PreFEst (Predominant-F0 Estimation method) which piece,seeSectionII-B,whilethebassiscloselyrelatedto detects the melody and bass lines in monaural sound the harmony. These lines are fundamental to the percep- mixtures. Unlike most previous F0 estimation methods, tionofmusicandusefulinMIRapplications.Forexample, PreFEstdoesnotassumethenumberofsoundsources. theestimatedmelody(vocal)linefacilitatessongretrieval PreFEst basically estimates the F0 of the most predo- based on similar singing voice timbres [41], music minant harmonic structureVthe most predominant F0 retrieval/classification based on melodic similarities, and corresponding to the melody or bass lineVwithin an in- music indexing for query by humming which enables a tentionally limited frequency range of the input sound usertoretrieveamusicalpiecebyhummingorsingingits mixture. It simultaneously takes into consideration all melody. Moreover, for songs with vocal melody, once the possibilitiesfortheF0andtreatstheinputmixtureasifit singingvoiceisextractedfrompolyphonicsoundmixtures, contains all possible harmonic structures with different the lyrics can be automatically synchronized with the weights (amplitudes). It regards a probability density singing voice by using a speech alignment technique and function (PDF) of the input frequency components as a can be displayed with the phrase currently being sung weighted mixture of harmonic-structure tone models highlighted during song playback, like the Karaoke (represented by PDFs) of all possible F0s and simulta- display [42]. neouslyestimatesboththeirweightscorrespondingtothe The difficulty of estimating melody and bass lines relative dominance of every possible harmonic structure depends on the number of channels: the estimation for andtheshapeofthetonemodelsbymaximumaposteriori stereo audio signals is easier than the estimation for probability (MAP) estimation considering their prior monaural audio signals because the sounds of those lines distribution.Itthenconsidersthemaximum-weightmodel tend to be panned to the center of stereo recordings and as the most predominant harmonic structure and obtains the localization information can help the estimation. In its F0. The method also considers the F0’s temporal general, most methods deal with monaural audio signals continuity by using a multiple-agent architecture. Vol.96,No.4,April2008|Proceedings of the IEEE 677
Description: