TITLE: Bayesian Music Alignment( Dissertation_全文 ) AUTHOR(S): Maezawa, Akira CITATION: Maezawa, Akira. Bayesian Music Alignment. 京都大学, 2015, 博士(情報学 ) ISSUE DATE: 2015-03-23 URL: https://doi.org/10.14989/doctor.k19106 RIGHT: 許諾条件により本文は2015/10/03に公開 Bayesian Music Alignment Akira Maezawa Abstract Thisthesisaddressestemporalalignmentofmusicaudiosignalswithasymbolicmu- sic score and alignment of multiple music audio signals, each of which represents a common piece of music (music alignment). This is an important task in many fields such as music production, musicological analysis, and informed sound source sepa- ration. Thisthesisfocusesonalignmentamongmultipleaudiosignals(audio-to-audio alignment),andthatbetweenasymbolicscoreandoneormoreaudiosignals(audio-to- scorealignment). Moreover,thethesisfocusesonWesternmusic,whichispolyphonic, consistsmostlyofharmonicsounds,andisplayedaccordingtoamusicscore. Music alignment is difficult because there are variations among different perfor- mances,eventhoughtheyallplayasamemusicscore. Suchavariationofmusicaudio signalscallsforaprobabilistictreatmentofmusicaudiosignals. Furthermore,thefact that different audio signals have some aspects in common calls for a framework for encodingdifferentconstraintsthatareknowninadvance. Theserequirementscanbe satisfied through Bayesian inference. This thesis approaches music alignment from a Bayesian perspective. Chapter 1 presents an overview of music alignment and its relevance. Chapter2reviewsbackgroundonalignmenttechniques. Chapter 3 presents a Bayesian audio-to-score alignment method. Here, the sym- bolic music score is given, and the main goal is to infer the alignment, taking into accountvariationsofaudiosignalsintimbre,volume,andtempo. Todealwithvaria- tionsintimbreanddynamics,polyphonicpitchedsoundsinthespectraldomainare modeledasaprobabilisticallyweightedsumofspikes. Thesespikesareplacedatfre- quencies where prominent energy is expected to be observed, via analyzing which notes are present in the music score at a given position. Robustness to variations in timbre and dynamics is attained by assuming a weakly informative prior on the weights. Furthermore, the model represents a smoothly changing tempo trajectory shared among different parts, while allowing for slight asynchronies between dif- ferent parts. An experimental evaluation demonstrated that the proposed method achieved a median alignment error of about 30 ms on real-world ensemble pieces, andabout50msonsyntheticorchestralrecordings. Chapter 4 presents a Bayesian audio-to-audio alignment method. The problem is difficult because, unlike audio-to-score alignment, the underlying music score is not given. Totacklethisissue,theunderlyingmusicalpieceisexpressedprobabilistically. Then,itisinferredfromtheinputaudiosignals,assumingthateachaudiosignalplays i the same musical piece. Since a music score consists of a relatively few note combi- nations that get reused throughout the piece, the musical piece is represented as a Markovchainwithsparsetransitionprobabilities. Furthermore,theaudiosignalsare represented so that they have different but interrelated tempo trajectories. An exper- imentalevaluationonreal-worldpianomusiccollectionshowedamedianalignment errorof60ms,outperformingexistingprobabilisticmethods. Inareal-worldusescenario,musicalignmentisfurtherdeterioratedfortwomore reasons. First, some audio signals may play only a subset of the music score. For example, a user might be interested in aligning between an orchestral piece and a hummed melodic line to the piece. This new kind of alignment is defined as subset music alignment. Second, the room acoustics may vary significantly. For example, an orchestral recording may be recorded in a highly reverberant hall, while a hummed melodic line to the piece may be recorded in a bedroom. Next two chapters address thesetwoproblems. Chapter 5 presents a subset music alignment method. It is difficult because the input signals play different sequences of note combinations, with some notes that are played in common. To tackle this issue, the proposed method decomposes the input audio into components common to different recordings and those unique to eachrecording,andsimultaneouslyaligningtheaudiosignalsbasedonthecommon components. Namely, a hierarchical point process called the hierarchical Dirichlet process (HDP) is used to represent a subset-like relationship in two layers. On the upperlayer,oneHDPrepresentseachnotecombinationinapieceofmusicasasubset of all possible notes that are used in the piece. Then, on the lower layer, the other HDPrepresentsthenotecombinationsinsideeachaudiosignalassubsetsofthenote combinations of the top layer HDP. HMMs are used to encode the order of possible notestoplay. Evaluationsshowedthatwhenaligningaudiosignalsthathadonlyone partincommon,theproposedmethodachievedamedianalignmentofabout200ms, outperformingexistingmethods. Chapter 6 presents a dereverberation method that can be used as a front-end to anyalignmentmethod. Itisdesignedtoattenuatelong-termreverberation(calledthe late reverberation), which causes past musical notes to smear into the current time instance. Dereverberation is formulated as a deconvolution problem of non-negative autoregressive(AR)processinthepowerspectrogramdomain. Sincetheorderofthe AR process is unknown, the Dirichlet process is used to encode a varying number of coefficients that are used. The model is non-conjugate, making inference difficult. To allow inference, a novel inference algorithm based on minorization- maximization is derived. An experimental evaluation showed that even though audio-to-score align- ment accuracy degrades under a reverberant environment, the proposed method is capableofrecoveringtheaudio-to-scorealignmentaccuracycomparabletothatunder noreverberation. Chapter7discussesthethesisandpresentsdirectionsforfutureresearch. ii Acknowledgements IwouldliketoexpressmyspecialthanksandappreciationtomyadvisorProf. Hiroshi G.Okuno(nowatWasedaUniversity). Hehasallowedmetopursuearesearchtopic that matters to me, and vigorously advised my research through his broad insights and keen comments. I cannot stress how important it was for me to be researching independently,whilereceivingthesupportandadvisingforconductingandpresent- ing my research. This section would not end if I listed the things I have learned from Prof. Okuno, but, in particular, what I have learned through his style of presentation andwritinghasbecomeaninvaluableassetofmylife. I would also like to express my special thanks and appreciation to Prof. Tatsuya Kawahara, who has supervised my thesis after Prof. Okuno has left Kyoto Univer- sity, and helped me complete the thesis. His comments were essential for organizing the significance of my research from a broad perspective and presenting it as a uni- fied work. Moreover, he has provided numerous valuable comments that helped me improvethequalityofthethesis. Furthermore, I would like to express my warm thanks and appreciation to Dr. KazuyoshiYoshii,whohasprovidedsuggestionsfromtheearlystageofmyresearch, when my interest in Bayesian inference and music alignment were growing. It has simply been a joy to be working with a researcher whose works I admire. I have learned a lot in Bayesian inference, writing and presentation from him. My interest in Bayesian inference has sparked through conversations with Dr. Yoshii, so I cannot emphasizeenoughthatwithouthim,Iwouldhavenotwrittenthedissertation. I also express my special thanks and appreciation to Prof. Toshiyuki Tanaka, the member of my dissertation committee. His numerous astute comments were imper- ativeforimprovingthequalityofthedissertation. The works presented here were conducted mostly at Yamaha Corporation. This dissertation would have not been written without the support of many staffs from Yamaha. First and foremost, I cannot thank my boss Mr. Takuya Fujishima enough for supporting my pursuing of the Ph.D. program. The various pointers he has of- feredandthenumerousnegotiationsthathehadwiththevariousstakeholderswere mandatoryforbothstartingandcompletingthedissertation. Ialsoexpressmywarm thanks to the division managers Mr. Yukihiro Kawaguchi and Mr. Motoichi Tamura, and the staffs at the human resources division, for allowing me to pursue the Ph.D. program. I would also like to especially thank Mr. Yoshinari Nakamura, Mr. Naoki iii Yasuraoka, Dr. Kazunobu Kondo, Dr. Yu Takahashi, Mr. Norihiro Uemura and Mr. Kouhei Sumi for inspiring conversations on statistical signal processing and music alignment. I also express my special thanks and appreciation to Prof. Masataka Goto, the Prime SeniorResearcher of the InformationTechnology Research Institute atthe Na- tionalInstituteofAdvancedIndustrialScienceandTechnology. Theinterestinmusic alignmentsparkedduringmyinternatProf. Goto’slab. Icannotexpressenoughhow valuabletheinternwas,andhowmuchpleasureandhonoritwastoworkunderProf. Goto and the members of his lab. I learned a lot from his keen insights, often years aheadofthetime. IalsoexpressmythankstoformermembersofOkunolab,especiallyProf. Tetsuya Ogata (now at Waseda University), Prof. Kazunori Komatani (now at Osaka Univer- sity), Prof. Toru Takahashi (now at Osaka Sangyo University) and Dr. Katsutoshi Itoyama. I have learned a lot from them during the master’s program. In particular, I was lucky to co-author papers with Dr. Katsutoshi Itoyama during my Ph.D. pro- gram; his sharp comments were invaluable to me. I also warmly thank Ms. Hiromi Okazaki,whohasprovidedsecretarialsupportatourlab. Ialsothankmycolleagues atthelabfortheintellectuallystimulatingconversations. I also thank my parents for the support that I have received. I especially thank them for allowing me, during my high school years, to spend a great deal of time performingmusic,composingmusic,andtinkeringwithcomputers–Iwouldnotbe doingresearchinmusicinformationretrievalifitwerenotforthem. Most importantly, I thank my wife Yumiko for (1) her unconditional support, (2) herinfinitepatience,and(3)understandinghowtohandlemewhenIamcoding. iv Contents Contents vi ListofFigures viii ListofTables ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Usesofalignmenttechniques . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Issuesinmusicalignment . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Organizationofthisthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 ReviewofMusicAlignment 9 2.1 Preliminariesonmusicandmusicalacoustics . . . . . . . . . . . . . . . 9 2.2 PreliminariesonBayesianinference . . . . . . . . . . . . . . . . . . . . 13 2.3 Musicalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 FormulationofBayesianmusicalignment . . . . . . . . . . . . . . . . . 27 3 BayesianAudio-to-ScoreAlignment 29 3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 BayesianAudio-to-AudioAlignment 51 4.1 Conceptualoverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 v CONTENTS 5 BayesianSubsetMusicAlignment 71 5.1 Subsetmusicalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6 BayesianDereverberationforMusicAlignment 91 6.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7 Conclusion 115 7.1 Contributionofthisthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Directionsforfutureresearch . . . . . . . . . . . . . . . . . . . . . . . . 116 A Application: ViolinFingeringInference 119 A.1 Violinfingering: aprimer . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.5 Application: de-mystifyinghistoricalrecordings . . . . . . . . . . . . . 136 A.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B ListofDistributions 139 Bibliography 141 ListofPublications 153 vi List of Figures 1.1 Anexampleoftherecordingworkflow. . . . . . . . . . . . . . . . . . . 2 1.2 Aninterpretationretrievalsystemusingalignmenttechnique. . . . . . 4 1.3 Therelationshipbetweenamusicpiece,amusicscoreandaudiosignals. 5 1.4 Aspectsconveyedinamusicscore . . . . . . . . . . . . . . . . . . . . . 6 1.5 Organizationofthisthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Source-filtermodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Scientificpitchnotationand f . . . . . . . . . . . . . . . . . . . . . . . . 12 0 2.3 Aspectsconveyedinamusicscore) . . . . . . . . . . . . . . . . . . . . . 13 2.4 ThespectrogramofaWesternmusic . . . . . . . . . . . . . . . . . . . . 14 2.5 HierarchicalDirichletProcess. . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 IllustrationofPLCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Problemstatementofmusicalignment. . . . . . . . . . . . . . . . . . . 21 2.8 Exampleofinter-partasynchrony. . . . . . . . . . . . . . . . . . . . . . 27 2.9 Unifiedformulationofmusicalignment. . . . . . . . . . . . . . . . . . . 28 3.1 Conceptualideaoftheproposedmethod. . . . . . . . . . . . . . . . . . 31 3.2 Modelofthemusicscoreasastatesequence. . . . . . . . . . . . . . . . 32 3.3 Modelofthestatedurationpdf. . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Graphicalmodeloftheproposedaudio-to-scorealignmentmethod. . . 37 3.5 SDR and SIR of the separated parts, using score-informed separation basedontheproposedmethod. . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 Anoverviewofgenerativeaudioalignment. . . . . . . . . . . . . . . . 52 4.2 Theideaofhowapieceisgenerated. . . . . . . . . . . . . . . . . . . . . 54 4.3 Theideaofthedurationmodeling. . . . . . . . . . . . . . . . . . . . . . 55 4.4 SimilaritymatrixcomputedfromChopinOp.41-2 . . . . . . . . . . . . 57 4.5 Feature sequences (chroma vector) of two performances, overlayed by pointswherethestateofthelatentcompositionchanges. . . . . . . . . 58 4.6 A short phrase played with 24 different interpretations, and the esti- mateduandΛ. Dottedarrowsindicatethechoiceofexecutedphrasing. 60 4.7 Themodelofstatedurationpdf(p(l ) inFig.4.8). . . . . . . . . . . . . 62 nd 4.8 Themodeloftheperformanceandcompositionsequence . . . . . . . . 62 vii
Description: