ebook img

Recommender systems inspired by the structure of quantum theory PDF

0.6 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Recommender systems inspired by the structure of quantum theory

Recommender systems inspired by the structure of quantum theory Cyril J. Stark∗ January 25, 2016 6 Abstract 1 Physicists use quantum models to describe the behavior of physical systems. Quantum models owe 0 their success to their interpretability, to their relation to probabilistic models (quantization of classical 2 models)andtotheirhighpredictivepower. Beyondphysics,thesepropertiesarevaluableingeneraldata n science. This motivates the use of quantum models to analyze general nonphysical datasets. Here we a provide both empirical and theoretical insights into the application of quantum models in data science. J In the theoretical part of this paper, we firstly show that quantum models can be exponentially more 2 efficientthanprobabilisticmodelsbecausethereexistdatasetsthatadmitlow-dimensionalquantummod- 2 els and only exponentially high-dimensional probabilistic models. Secondly, we explain in what sense quantum models realize a useful relaxation of compressed probabilistic models. Thirdly, we show that ] G sparse datasets admit low-dimensional quantum models and finally, we introduce a method to compute hierarchicalorderingsofpropertiesofusers(e.g.,personalitytraits)anditems(e.g.,genresofmovies). In L theempirical part of the paper, weevaluate quantum models initemrecommendation andobservethat . s thepredictivepowerofquantum-inspiredrecommendersystemscancompetewithstate-of-the-artrecom- c mendersystemslikeSVD++andPureSVD.Furthermore,wemakeuseoftheinterpretabilityofquantum [ modelsbycomputinghierarchicalorderingsofpropertiesofusersanditems. Thisworkestablishesacon- 1 nection between data science (item recommendation), information theory (communication complexity), v mathematical programming (positive semidefinite factorizations) and physics (quantum models). 5 3 0 1 Introduction 6 0 Recommendation is a key discipline in machine learning that aims at predicting which items (e.g., movies, . 1 books but also events) are liked by which people [1, 2, 3, 4]. Algorithms to compute these predictions are 0 calledrecommendersystems. Ideallyrecommendersystemsaddressthefollowingthreeobjectives: (1)predic- 6 tive power, (2) computational tractability, and (3) interpretability (e.g., to compute visual representations). 1 : Here we address these challenges by adopting the system-state-measurement paradigm [5] in the form of a v class of models (quantum models) physicists use to describe quantum systems [6]. i X Respecting the system-state-measurement paradigm amounts to distinguishing the ‘state of a system’ r fromthe‘measurementdevice’weusetoexaminethatsystem. Forexample,inrecommendation,thesystem a is that abstract part of our mind that decides whether we like or dislike an item. Preferences vary from person to person. The correspondingly varying manifestation of a person’s system system is described by the state of the system. Measurements are questions like “Do you like the movie Despicable Me?”. Performing measurements on the taste of a person, we can get an increasingly refined understanding of a person’spreferences, i.e., wegetanincreasinglyrefinedunderstandingofaperson’sstate. Hence, statesand measurements are examples for user and item representations. In quantum models, both the states (user representations) and the measurements (item representations) are described in terms of normalized positive semidefinite (psd) matrices. We motivate the use of quantum models in terms of the following five points. ∗Center for Theoretical Physics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge MA 02139- 4307,USA 1 • We show that quantum models realize convenient relaxations of both compressed (section 7) and uncompressed (section 10) probabilistic models (see section 3). • Let d be the dimension of the lowest-dimensional probabilistic model fitting a given dataset, and let d be the dimension of the lowest-dimensional quantum model for the same dataset. Then, d ≤ d Q Q always. Ontheotherhand,usingaresultsoncommunicationcomplexity,weshowcanshowthatthere exist datasets where d is exponential in d (see section 9). Hence, for some datasets, we cannot hope Q to be able to compute probabilistic models but we can hope to find low-dimensional quantum models. • The close relationship between probabilistic and quantum models allows us to ‘quantize’ methods developedforprobabilisticmodels. Wedemonstratethisby‘quantizing’atechniquefrom[5]tocompute hierarchical orderings of properties (e.g., tags) of items and users (see section 11). • The success of quantum models in physics is largely due to their interpretability which is weaker thantheinterpretabilityofprobabilisticmodelsbutgreaterthantheinterpretabilityofgeneralmatrix factorizations. Thisobservationcombinedwithd ≤d(always)andd (cid:28)donsomedatasetssuggests Q Q thatquantummodelsrealizeapracticalcompromisebetweenhighinterpretabilityandlowdimensionality of representations (e.g., user and item representations). • We demonstrate the predictive power of quantum models empirically (see section 12). We observe that with respect to mean-average-error and recall, quantum models outperform major recommender systems like SVD++ [7] and PureSVD [8] on MovieLens datasets. The remainder of this paper has four parts. In part 1 we provide preliminaries (secions 2 and 3). In part 2 we define quantum models, provide a simple algorithm for their computation and summarize related work (sections 4, 5 and 6). In part 3 we derive properties of quantum models (section 7 to section 11). In part 4 we provide empirical results evaluating the performance of quantum models on movieLens datasets (section 12). In part 5 we conclude the paper (section 13). 2 Notation For any n ∈ N we denote by [n] the set {1,...,n}. Throughout, we consider item recommendation. Here, u ∈ [U] labels users, i ∈ [I] labels items and z ∈ [Z] denote possible ratings users provide for items (e.g., z ∈ [5] in case of 5-star ratings). By R ∈ [Z]U×I we denote the complete rating matrix, i.e., R ∈ [Z] is ui the rating that user u provides for item i. In practice, we only know a subset of the entries of R. We use Γ⊆[U]×[I] to mark the known entries of R at the time of data analysis. In the evaluation of recommender systems we use Γ ⊆ [U]×[I] to mark entries in the training set, and we use Γ ⊆ [U]×[I] to mark train test entries in the test set. A finite probability space is described in terms of a sample space Ω = {ω ,...,ω }, 1 D probabilitydistributionsp(cid:126)andrandomvariablesEˆ withsomealphabet[Z]. Weuse∆={p(cid:126)∈RD|(cid:107)p(cid:126)(cid:107) =1} + 1 to denote the unit simplex. We denote by P [Eˆ =z] the probability for user u to rate item i with z ∈[Z]. u i For a matrix A, (cid:107)A(cid:107) denotes its Frobenius norm, (cid:107)A(cid:107) denotes the nuclear norm, and (cid:107)A(cid:107) denotes its F 1 operatornorm. Wedenotebyspect(A)theeigenvaluesofA. WeuseS+(CD)torefertothesetofhermitian complex D×D matrices which are positive semidefinite (psd). I denotes the identity matrix. 3 Preliminaries By Kolmogorov, random (finite) experiments are described in terms of the following three parts. Firstly, a sample space Ω = {ω ,...,ω }. Secondly, a probability distribution p(cid:126) ∈ ∆ = {p(cid:126) ∈ RD|(cid:107)p(cid:126)(cid:107) = 1}. Thirdly, 1 D + 1 a random variable Eˆ, i.e., a function Eˆ : Ω → [Z]. In the following, P[Eˆ = z] denotes the probability for measuring the event {ω|Eˆ(ω) = z} = Eˆ−1(z). A random variable Eˆ is fully specified by indicator vectors E(cid:126) ∈ {0,1}D whose supports equal {ω|Eˆ(ω) = z} = Eˆ−1(z). Thus, these indicator vectors satisfy z (cid:80) E(cid:126) =(1,...,1)T. We observe that P[Eˆ =z]=E(cid:126)Tp(cid:126). z z z 2 In this section we show (at the example of item recommendation) how the system-state-measurement paradigm can be adopted in the form of normalized nonnegative models (NNM) [5]. In the application of NNMs, the system is described by some sample space Ω={ω ,...,ω }. Each user u is represented in terms 1 D of a probability distribution p(cid:126) ∈ ∆ on Ω, and answers to questions “Do you like item i?” are regarded as u samples of a random variable Eˆ : Ω → [Z] assigned to item i. In case of 5-star ratings, Z = 5. In the i remainder, P [Eˆ =z] denoted the probability for user u to rate i with value z ∈[Z]. Therefore, u i P [Eˆ =z]=P [Eˆ−1(z)]=E(cid:126)Tp(cid:126) (1) u i u i iz u where E(cid:126) ∈{0,1}D is defined through iz (cid:0)E(cid:126) (cid:1) =(cid:26) 1, if ωj ∈Eˆi−1(z), (2) iz j 0, otherwise. By construction, (cid:80) E(cid:126) = (1,...,1)T. Let M be the set of allowed values of (E(cid:126) ,...,E(cid:126) ), i.e., M = (cid:110)(E(cid:126) ,...,E(cid:126) ) ∈ {0,1z}Di×zZ(cid:12)(cid:12)(cid:80) E(cid:126) = (1,...,1)T(cid:111)0. We call the collection of vectors (cid:0)1(p(cid:126) ) ,Z(E(cid:126) ) (cid:1) a K0ol- 1 Z (cid:12) z iz u u iz iz mogorov factorization if p(cid:126) ∈ ∆ for all users u ∈ [U] and if (E(cid:126) ) ∈ M for all items i ∈ [I] (see [5]). u iz z∈[Z] 0 The convex relaxation of M is 0 M:=(cid:110)(E(cid:126) ,...,E(cid:126) )∈RD×Z(cid:12)(cid:12)(cid:88)E(cid:126) =(1,...,1)T(cid:111). 1 Z + (cid:12) iz z The tuple of vectors (cid:0)(p(cid:126) ) ,(E(cid:126) ) (cid:1) is a normalized nonnegative model if p(cid:126) ∈∆ for all users u∈[U] and if u u iz iz u (E(cid:126) ) ∈M for all items i∈[I] (see [5] for simple examples). iz z∈[Z] NNMs allow for interpreting user preferences as mixture of interpretable user stereotypes [5], they allow for the computation of hierarchical orderings of tags of users and items [5], they allow for fast inference of users’statesintermsofinterpretablequestions[9],andNNMscanbeusedtocomputeoperationaluser-user and item-item distance measures [10]. Consequently, the relaxation of Kolmogorov factorizations (cid:55)→ NNMs preserves a lot of the interpretability of Kolmogorov factorizations. 3.1 Categorical variables In the definition of NNMs, random variables are categorical. Hence, NNMs can in principle be used to fit a wide range of data. For instance, in recommendation, side information about the occupation of users can be modeled in terms of a categorical random variable. However, we do not make use of all available information when treating ordered data in a categorical manner. Forinstance, starratings(providedbyusersforitems)areorderedbecausea4-starratingisbetter than a 3-star rating. Hence, if we want to use data economically, it can be beneficial to interpret ratings R ∈[Z] of an item i provided by a user u as approximation of P[u likes i]. More precisely, ui R /Z ≈P[u likes i]. (3) ui Then, the random variables Eˆ are binary. Their outcomes are interpreted as ‘I like item i’ and ‘I dislike i item i’, respectively. 4 Quantum models By (1), NNMs are specified in terms of nonnegative vectors ∈ RD. Equivalently, we could describe NNMs + through diagonal psd matrices: (cid:88) p(cid:126) (cid:55)→ (p(cid:126) ) (cid:126)e (cid:126)eT =:ρ(0), u u j j j u j (4) E(cid:126) (cid:55)→(cid:88)(E(cid:126) ) (cid:126)e (cid:126)eT =:E(0) iz iz j j j iz j 3 where ((cid:126)e ) = δ denotes the canonical basis. The two descriptions p(cid:126) ,E(cid:126) and ρ(0),E(0) are entirely j n jn u iz u iz equivalent if P [Eˆ =z]=(cid:88)(ρ(0)) (E(0)) =tr(ρ(0)E(0)). (5) u i u ij iz ij u iz ij But the description ρ(0),E(0) motivates a natural relaxation of NNMs by relaxing the constraint that the u iz positivesemidefinite(psd)matricesρ(0),E(0) arediagonal. Thisrelaxationissometimescalled‘quantization’ u iz in the physics literature. After quantization, the state of a user u is represented by ρ ∈∆(cid:48), u ∆(cid:48) =(cid:8)ρ∈S+(CD)(cid:12)(cid:12)tr(ρ)=1(cid:9), (6) and an item i is represented by (E ) ∈M(cid:48), iz z∈[Z] (cid:110) (cid:12)(cid:88) (cid:111) M(cid:48) = (E ) ∈S+(CD)Z(cid:12) E =I . (7) z z∈[Z] (cid:12) z z In quantum information [6], ∆(cid:48) is the space of so called quantum states (aka density matrices) and M(cid:48) is the space of quantum measurements (aka positive operator valued measure). For this reason, we call the tuple of user and item matrices (cid:0)(ρ ) ,(E ) (cid:1) a quantum model if ρ ∈ ∆(cid:48) for all users u ∈ [U] and if u u iz iz u (E ) ∈M(cid:48) for all items i∈[I]. Inspired by (8), the rule iz z∈[Z] P [Eˆ =z]=(cid:88)(ρ ) (E ) =tr(ρ E ) (8) u i u ij iz ij u iz ij relates user and item representations to observable probabilities. Here, z¯denotes complex conjugation of z. Uptonormalizationconstraints,modelsintermsofρ andE areintimatelyrelatedtopsdfactorizations[11, u iz 12] studied in mathematical programming; see [13] for a recent review. Example. Let D =2 and Z =2. Then, for (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) 1 1 2 1 1 3 1 9 −3 ρ = , E = , E = , u 5 2 4 i1 10 3 9 i2 10 −3 1 we get 49 1 P [Eˆ =1]=tr(ρ E )= , P [Eˆ =2]=tr(ρ E )= ,. u i u i1 50 u i u i2 50 5 Heuristic computation of quantum models We denote by R ∈ [Z] the rating user u provides for item i. Let Γ ⊆ [U]×[I] be such that (u,i) ∈ Γ ui if and only if R is known. We adopt (3) and interpret z = 1 as ‘like’ and z = 2 as ‘dislike’. Then, the ui computation of NNMs amounts to approximately solving some variant of the optimization problem minimize (cid:88) (cid:0)E(cid:126)Tp(cid:126) −R /Z(cid:1)2 i1 u ui (u,i)∈Γ (9) s.t. p(cid:126) ∈∆,(E(cid:126) ,E(cid:126) )∈M u i1 i2 Similarly, the computation of quantum models amounts to solving a variant of (cid:88) (cid:0) (cid:1)2 minimize tr(E ρ )−R /Z i1 u ui (u,i)∈Γ (10) s.t. ρ ∈∆(cid:48),(E ,E )∈M(cid:48) u i1 i2 A simple approach to approximately solve (10) is through alternating constrained optimization where we distinguish between the update of the user representation and the update of the item representation. When updating the users we keep the items fixed and when updating the items we keep the users fixed. 4 4x 105 3.5 3 2.5 frequency1.25 1 0.5 0 1 2 3 4 5 star ratings Figure 1: Frequency of ratings as a function of z ∈[5] at the example of the MovieLens 1M dataset. Choosing a convenient initialization of the user matrices/vectors improves the performance of alter- nating optimization. Next, we describe one particular possibility to find convenient initializations. Our initialization-strategyisbasedontheobservationthatreal-lifedataisoftentimessubjecttoastrongselection- biastowardshighratings;seeforexamplefigure1abouttheMovieLens1Mdataset.1 Selectionbiasesofthat kind can be detrimental for item recommendation [14]. To illustrate this, we imagine that all the provided ratings are 5-star ratings. Then, the 1(!)-dimensional factorization with p = 1 and E = ... = E = 0, u i1 i4 E = 1 fits the data perfectly but is expected to have extremely poor predictive power because it predicts i5 that all users like all items. The selection bias is a consequence of the simple fact that consumers usually only consider watching a movie which they expect to like. Hence, conditioned on a movie being watched, it is more likely to be rated highly. Ontheotherhand,ifusershavenotyetwatchedamovie,thenthissuggeststhattheywouldratethe movie badly if they watched it. Therefore, we should expect a movie i to be rated badly by user u if u has not yet rated i. This side-information can be used to improve the performance of recommender systems [8]. How we make use of knowing about the selection-bias depends on the figure of merit. When aiming for large recall,2 we suggest to assume that all the unknown entries of the user-item matrix R are zero when approximating (10)—the precise star value does not matter too much and those zero-fillings capture a generaltrend. Ontheotherhand,whenaimingforsmallmean-average-error (MAE),weshouldnotproceed in exactly the same manner because setting unknown entries equal to zero can distort the precise value of the predicted star ratings too much. This is why we suggest to only use the zero-fillings during initial iterations of alternating optimization (to end up with an initialization of the users and items that is better than a random initialization). We observe empirically that said strategies for initialization indeed improve predictive power. Zero-fillings of this kind are common in the design of recommender systems [8]. Adopting (3), we thus arrive at Algorithm 1 for solving (10). Some practically-oriented readers might refrain from implementing Algorithm 1 because it requires the solution of SDPs. However, for the purpose of recommendation, we do not require a full-fledged SDP solver like SDPT3 [15]—a simple barrier function to enforce the SDP constraint approximately is sufficient because we are not primarily interested in high precision. 6 Related work Data mining. In the study of recommendation we are interested in events like ‘u buys item i’, ‘u attends event i’, etc. We abbreviate these events by (u,i). In aspect models, pLSA [16, 17] (and similarly in latent 1http://files.grouplens.org/datasets/movielens/ml-1m-README.txt 2Theappendixof[5]containsareminderofthedefinitionsofrecall,MAEandRMSE. 5 Algorithm 1 Alternating optimization for quantum models 1: Fix D (e.g., by cross validation). 2: For all u, initialize ρu =(cid:126)vuT(cid:126)vu where(cid:126)vu is sampled uniformly from the complex unit sphere. (cid:80) (cid:0) (cid:1)2 3: For all items i, solve the semidefinite program min(Ei1,Ei2)∈M(cid:48) u:(u,i)∈Γ tr(Ei1ρu)−Rui/Z . (cid:80) (cid:0) (cid:1)2 4: For all users u, solve the semidefinite program minρu∈∆(cid:48) i:(u,i)∈Γ tr(Ei1ρu)−Rui/Z . 5: Repeat steps 3 and 4 until a stopping criteria is satisfied; e.g., until a maximum number of iterations is reached. For the first 2 iterations, when optimizing for MAE, we pretend we knew all of R by setting unknownentriesequaltozero. Whenoptimizingforrecall,weusethosezero-fillingsduringalliterations. Dirichlet allocation [18]) we regard the value (u,i) as random variable with distribution (cid:88) P[u,i]= P[k]P[u|k]P[i|k]. (11) k Thus, when adopting (11), we model P[u,i] as (possibly scaled) inner product between vectors (P[u|k]) k and (P[i|k]) . This is reminiscent of the inner product (8) where one half of the inner product is the non- k commutatitive analog of a probability distribution associated to u. However, (11) and (8) disagree on the secondhalfofsaidinnerproduct. In(11),thesecondfactorisanotherprobabilitydistribution. In(8)onthe other hand, the second factor E is constrained through the existence of E ,...,E ,E ,..,E such iz i1 iz−1 iz+1 iZ that (E ,...,E )∈M(cid:48). i1 iZ In a similar manner, quantum models are related to more general NMF-based models [19, 20] because in both approaches we model data in terms of cone-constraint vectors: nonnegative vectors in case of NMF and psd matrices in case of quantum models. The description of data in terms of ρ , E allows for the description of categorical random variables u iz (see section 3.1) and we can straightforwardly extract hierarchical structures from quantum models; see section 11. Aspect models can also be used for the description of categorical variables [16] and for the computationofhierarchicalstructures[21,22]. Quantummodelsrealizeanalternativeapproachformeeting these objectives. Furthermore, categorical random variables can also be modeled by a family of graphical models called multinomial mixture models [23]. These model class puts forward a single distribution θ (only onedistributionforallusers)oversocalleduserattitudes. Thus,incontrasttoNNMsandquantummodels, thepredictionforuseru’sratingofitemiisnotafunctionofanindividualuservectorandanitemvector. We regard the separate description of users and items to be important because it is one pillar of interpretability of models like NNMs and quantum models. Probabilistic matrix factorization (PMF, [24]) is a very practical class of models related to quantum models. In PMF we regard entries R of the rating matrix as independent Gaussian random variables with ui mean U(cid:126)TV(cid:126) and variance σ. Therefore, as in case of aspect models and NMF, the description of ratings is u i the result of inner products between some vectors. In contrast to PMF, quantum models do not need to assume a Gaussian distribution for ratings. Once the dimension D has been fixed by cross validation, we do not need to assume a particular family of distributions for R . ui Bayesian nonnegative matrix factorization (BNMF) [25] is closely related to PMF because ratings are sampled from a particular parametric family of distributions and because we impose a low-rank (also non- negative in BNMF) model onto the parameters. BNMF assumes Poisson-distributed ratings. Hence, using BNMF we can model discrete but not categorical random variables. InAlgorithm1wemakeuseoftheselectionbiasinrecommendationdatasets. Selectionbiaseshavebeen studied in more detail in the paper [14]. This paper introduces the recommendation system AllRank which successfully accounts for the selection bias. We could have also included r and w into the training of m m quantum models but we felt that this would distract from the simplicity of inference of quantum models through alternating optimization. AllRank allows for great predictions but not for strong interpretability. Quantum theory. In physics we distinguish between two different ways to describe quantum experi- ments. The first possibility is simply to describe what we do and what we observe. For instance, we could 6 describeanexperimentintermsoftwomanualsandadataset. Thefirstmanualprovidesanexperimentalist withallthenecessaryinstructionstopreparesomequantumstatesandtoapplysomeprocesses. Thesecond manualspecifieshowtobuildandapplysomemeasurementdevices. Thedatasetisarecordofmeasurement results. Thesecond approach todescribequantumexperimentsisintermsofdensitymatrices,measurement matrices and processes [6]. This mathematical description allows for predictions of future measurement outcomes, and it makes the experiment highly interpretable. A significant part of physics is about inference, namely, the translation of the first empirical description (i.e.,thetwomanualsplusdataset)intothesecondmathematicaldescription(i.e.,states,measurementsand processes). Hereagainwedistinguishbetweentwodifferentapproaches. Themoretraditionalapproachuses physicalheuristicstotranslatethemanualsandthedatasetintoatheoreticalmodeloftheexperiment(e.g., through quantization of classical models). However, in the past two decades we observed limits of physical heuristics as the complexity of modern quantum experiments (e.g., superconducting qubits) appears to be too high to be accurately described in terms of heuristics [26, 27]. The resulting absence of accurate theoretical models for existing quantum devices presents a severe bottleneck for the design of new quantum devices and quantum computers. This deficiency led to the development of methods for self-consistent tomography[28,29,30,31,32,33,34,35,36,37,26,38,27,39,40,41,42,43,44,45,46,37,47]wherewetryto inferquantummodelsinawaythatavoidsphysicalheuristics. Inself-consistenttomography, anexperiment is regarded as a black box that accepts settings (specifying the state to prepare, the process to apply, and the measurement to perform) and outputs measurement results. This black box interpretation opens up the possibility to use methods from self-consistent tomography to fit quantum models to arbitrary non-physical datasets. The quantum models we obtain can then be used to interpret data through a generalization probability theory: quantum theory. Here, in this paper, we introduce this quantum perspective on general datasets at the example of item recommendation. The paper [48] shows that exact inference of lowest-dimensional quantum models is NP-hard. Here, in this paper, we are using structural properties underlying quantum models. Alternatively, we cantrytobuildquantumdevicestorunquantumalgorithmsformachinelearning[49,50,51,52,53,54,55, 56, 57, 58, 59]. We observe a quickly growing interest in this new and promising area at the intersection of physics and machine learning. Quantum foundations. Wearguethatquantummodelsrealizeausefultradeoffbetweenthelowcom- plexity of general matrix factorizations and the high interpretability of NNMs; see figure 6. This discussion fits naturally into the discipline called quantum foundations where researchers try to determine the opera- tional consequences of different choices of state and measurement spaces [60, 61, 62, 63, 64, 65, 66, 67]. The significanceofthecomparisonofdifferentregularizationschemesindatascienceconfirmsthepracticalvalue ofresultsproducedbyresearchersworkingonquantumfoundations. Herewegiveanargumentforquantum models in terms of the operational meaning of interpretability. Unfortunately, interpretability appears to be hard to capture axiomatically. Consequently, it is not clear how we can build axioms for quantum theory on the basis of interpretability. Cognition and economics. Item recommendation is related to the study of cognition and economics because item recommendation aims at capturing peoples preferences. In these disciplines, quantum models havebecomeincreasinglypopularinthepastyears;seeforinstance[68,69,70,71,72]andreferencestherein. 7 Quantum models as compression of NNMs Considering Z and the sparsity of item vectors E(cid:126) as a function of D, we call a NNM pseudo sparse if iz • Z small, i.e., Z ∼O(1) in D • For all items i there exists z(cid:48) ∈[Z] such that E(cid:126) ,...,E(cid:126) ,E(cid:126) ,E(cid:126) ,...,E(cid:126) y,1 y,z(cid:48)−1 yz(cid:48) y,z(cid:48)+1 y,Z (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) sparse sparse where ‘sparse’ means ‘sparsity ∼O(1) in D’. 7 (cid:0) (cid:1) Similarly, we call a quantum model (ρ ) ,(E ) pseudo-low rank [32] if u u iz iz • Z small, i.e., Z ∼O(1) in D • For all items i there exists z(cid:48) ∈[Z] such that E ,...,E ,E ,E ,...,E y,1 y,z(cid:48)−1 yz(cid:48) y,z(cid:48)+1 y,Z (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) low-rank low-rank where ‘low rank’ means ‘rank ∼O(1) in D’. ThefollowingTheorem1showsthatpseudo-sparseNMMscanbecompressedintolow-dimensionalquantum models. Theorem 1. Let J =U+IZ, let ε satisfy √J4D ≤ε≤ 21 and let (cid:0)(p(cid:126)u)u,(E(cid:126)iz)iz(cid:1) be a D-dimensional NNM. Assume (cid:0)(p(cid:126) ) ,(E(cid:126) ) (cid:1) is pseudo sparse, and u u iz iz d=O(cid:16) 1 ln(cid:0)4JD(cid:1)(cid:17), ε2 Then, there exists a d-dimensional quantum model (cid:0)(ρ(cid:48)u)u (Ei(cid:48)z)iz(cid:1) satisfying (cid:12)(cid:12)p(cid:126)TuE(cid:126)iz−tr(ρ(cid:48)uEi(cid:48)z)(cid:12)(cid:12)≤O(ε) for all u,i,z. WeproveTheorem1intheappendix. ByTheorem1,∆(cid:48) andM(cid:48) (seesection4)realizerelaxationsofthe compressionof∆andM(seesection3). Semidefiniteprogrammingallowsfortheefficientoptimizationover ∆andM. Ofcourse, wecouldalsocharacterizecompressedNNMsintermsofthepolyhedralconeC which is the image of RD under a Johnson-Lindenstrauss-type compression mapping [73]. However, that cone has + exponentially many faces when D is exponential in d. The time complexity of linear programming increases polynomially in the number of half-space constraints. It follows that optimizing over said polyhedral cone C is very expensive. Hence, we cannot reconstruct a compressed NNM by running methods like alternating optimization on C. However, we can optimize efficiently over the psd cone from Corollary 1 which realizes a relaxation of C. Hence, running methods like alternating optimization on the psd cone, we can compute relaxed compressed NNMs. 8 Fitting sparse data Quantummodelswouldnotbeofanyuseiftheycouldonlybefittedtoasmallclassofempiricaldata. Here we show that this is not the case by proving that quantum models can be fitted to sparse datasets. This is significant because most real-world datasets are sparse.3 Recall that Γ⊆[I]×[Z] marks the positions of known entries of the rating matrix R. By the following Theorem2,wecanfindlow-dimensionalquantummodelsfittingR withthepropertythatforall(u,i)∈Γ, Γ (cid:26) ≥1−ε, if R =z, tr(ρ E ) ui (12) u iz ≤ε, otherwise, if Γ is sparse. Theorem 2. Denote by U and I the number of users and items, respectively. Let J = U +IZ and let ε satisfy √4 ≤ε≤ 1. Assume that the number of times an item has been rated is constant in U. Then, for JD 2 (cid:16) 1 (cid:0) (cid:1)(cid:17) d=O ln 4JU , ε2 (cid:0) (cid:1) (cid:12) (cid:12) there exists a d-dimensional quantum model (ρu)u (Eiz)iz satisfying (cid:12)δz,Rui −tr(ρuEiz)(cid:12) ≤ O(ε) for all (u,i)∈Γ and z ∈[Z]. 3http://www.librec.net/datasets.html 8 By Theorem 2, there exist low-dimensional quantum model fitting sparse datasets. On the other hand, Theorem 2 states that it is easy to overfit sparse datasets with quantum models. These low-dimensional models have no predictive power despite being low-dimensional. This is because we end up with said low- dimensional quantum model by compressing a NNM that had no predictive power in the first place.4 Theo- rem 2 is proven in the appendix.5 9 Exponential dimension gap between quantum models and NNM For some ε∈[0,1/2), set d := min d NNM s.t. ∃ d-dimensional NNM (p(cid:126) ,E(cid:126) ) s.t. u iz (cid:12)(cid:12)p(cid:126)TuE(cid:126)iz−δz,Rui(cid:12)(cid:12)≤1−ε ∀ ui∈Γ (13) and d := min d Q s.t. ∃ d-dimensional q-model (ρ ,E ) s.t. u iz (cid:12) (cid:12) (cid:12)tr(ρuEiz)−δz,Rui(cid:12)≤1−ε ∀ ui∈Γ. (14) The following Theorem 3 is proven in the appendix using seminal results about the one-way communication complexity of Boolean functions [74, 75]. Theorem 3. For all data (R ) , d ≤d . Moreover, there exists (R ) such that ui ui∈Γ Q NNM ui ui∈Γ √ log (d )=O(log(n)), log (d )=Θ( n). 2 Q 2 NNM ByTheorem3thereexistpartiallyknownratingmatricesR forwhichd isexponentialind . Hence, Γ NNM Q there exist data tables that cannot be fitted by NNMs (for computational reasons) but can in principle be fitted by quantum models. In the appendix we prove Theorem 3 through a relation between the dimension of NNMs (quantum models) and bounded error (quantum) one-way communication complexity of Boolean functions. 10 Computation of NNMs through quantum models Imagine we want to find a NNM satisfying P [Eˆ = z] ≈ p(cid:126)TE(cid:126) . For this purpose we could run the natural u i u iz analogofAlgorithm1wheretheSDP-constraintisreplacedbyaRD-constraint. However,herewesuggestto + consider running Algorithm 1 for the computation of NNMs because when passing from NNMs to quantum modelsweenlargetheparameterspacethatisavailabletoheuristicalgorithmslikealternatingoptimization. Ingeneral,thisislikelytobebeneficialwheneverminimizersof (9)canbeextractedfromminimizersof (10). Here, in terms of the analysis of the following toy-example (Lemma 4), we show that this can be the case. Enlarging feasible sets in that manner is bread-and-butter in non-convex optimization [12] where we try to ‘lift’ difficult optimization problems to end up with easy optimization problems (ideally convex). Lemma 4. For each dimension D there exist NNMs (cid:0)(p(cid:126) ) ,(E(cid:126) ) (cid:1) with the following property. For every u u iz iz D-dimensional quantum model (cid:0)(ρ ) ,(E ) z(cid:1) satisfying p(cid:126)TE(cid:126) = tr(ρ E ) (for all u,i,z) there exists a u u iz i u iz u iz unitary matrix U such that Up(cid:126) p(cid:126)TU∗ =ρ , UE(cid:126) E(cid:126)TU∗ u u u iz iz for all u,i,z. The unitary U can be computed efficiently. 4Itpredictsthatwithprobability1,allthemissingentriesofRareequalto1(whichisanarbitrarychoice). 5ToourknowledgeitisanopenproblemtoprovewhetherornotaresultanalogoustoTheorem2holdsforNNMs. 9 Lemma 4 shows that there exist situations where an underlying NNM can be computed by solving the relaxation(10)of (9). Wethinkthattheclassoftoy-examplesprovidedintheproofofLemma4successfully captures the nature of situations where the available data about the users’ ratings is sufficient to push the user and item representations towards the boundary of RD and S+(CD) (so that uniqueness of any NNM + and quantum model is enforced). We provide the proof of Lemma 4 in the appendix. 11 Extraction of hierarchical orderings Here, we describe how quantum models can be used to order properties of items (or users) in a hierarchical manner; see for example figure 5. 11.1 Description of tags In practice, we often have not only access to entries of the user-item matrix R but we also have access to side information specifying properties of users and items. For instance, in movie recommendation, the MovieLens 1M dataset specifies the genre of each of the movies in the dataset. For example, Jurassic Park ∈{Action, Adventure, Sci-Fi} and Anchorman ∈{Comedy}. Let {it,...,it } be the set of items that have 1 mt been tagged with a particular tag τ from the set {τ } of all tags. Is there a way to represent tags (e.g., t t t∈[T] genre tags) mathematically? To answer this question, we suggest to consider the following game which was introduced in [5]. It involves a referee and a player named Alice. The game proceeds as follows. 1. The referee chooses a tag τ and a user u. t 2. Alice is given access to ρ , to τ , to {it,...,it } and to all item matrices E(cid:126) (Z = 2; z = 1 means u t 1 mt iz ‘like’, z =2 means ‘dislike’). 3. The referee chooses uniformly at random an item i∗ from the set {it,...,it } of all items that were 1 mt tagged with τ . t 4. Querying R, the referee checks whether user u likes or dislikes i∗. We denote u’s opinion by z∗ ∈ {like,dislike}. 5. Alice guesses z∗. She wins the game if she guesses correctly and loses otherwise. What is Alice’s winning probability? We denote by P[z∗ = 1|i] the probability for the event {z∗ = 1} conditioned on the event Referee draws i. It follows that P[z∗ = 1|i] = tr(ρ E ). The probability for the u i1 event Referee draws i is 1/m because in total there are m items tagged with τ . Thus, t t t (cid:88) P[z∗ =1]= P[z∗ =1|i]P[i]=tr(ρ E ) u t i∈{it,...,it } 1 mt with 1 (cid:88) E := E . (15) t m i1 t i∈{it,...,it } 1 mt E serves as useful characterization of the item property τ because E determines for each user u the t t t probability that u likes a random item with property τ . Similarly, in NNMs, we can describe tags through t tag-vectors E(cid:126) := 1 (cid:80) E(cid:126) . t mt i∈{it1,...,itmt} i1 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.