On Multistage Learning a Hidden Hypergraph A. G. D’yachkov, I.V. Vorobyev, N.A. Polyanskii and V.Yu. Shchukin Lomonosov Moscow State University, Moscow, Russia Email: [email protected], [email protected], [email protected], [email protected] Abstract—Learning a hidden hypergraph is a natural gener- Oneofthemostimportantaspectsofanysearchingstrategy alization of the classical group testing problem that consists in isits adaptiveness.Analgorithmisnon-adaptiveifallqueries detectingunknownhypergraph Hun=H(V,E) bycarrying out arecarriedoutinparallel.Analgorithmisadaptiveifthelater edge-detecting tests. In the given paper we focus our attention queries may depend on the answers to earlier queries. only on a specific family F(t,s,ℓ) of localized hypergraphs for which the total number of vertices |V| = t, the number of By Nna(t,s,ℓ) (Na(t,s,ℓ)) denotethe minimalnumberof edges |E| 6 s, s ≪ t, and the cardinality of any edge |e| 6 ℓ, queries in an F(t,s,ℓ)-searching non-adaptive (adaptive) al- ℓ ≪ t. Our goal is to identify all edges of Hun ∈ F(t,s,ℓ) gorithms.IntroducetheasymptoticrateofF(t,s,ℓ)-searching 6 by using the minimal number of tests. We develop an adaptive non-adaptive algorithms: 1 algorithm that matches the information theory bound, i.e., the 0 totalnumberoftestsofthealgorithmintheworstcaseisatmost Rna(s,ℓ), lim log2t , 2 sℓlog2t(1+o(1)). We also discussa probabilisticgeneralization t→∞Nna(t,s,ℓ) y of the problem. In similar way we define the asymptotic rate Ra(s,ℓ) of a Keywords: Group testing problem, learning hidden hy- F(t,s,ℓ)-searching adaptive algorithms. M pergraph, capacity, asymptotic rate, cover-free code Therestofthepaperisorganizedasfollows.InSect.II,we 7 I. INTRODUCTION discuss previously known results and remind the concept of 1 A. Notations and Definitions cover-free codes which is close to the subject. In Sect. III, we present the main result of the paper and provide the Let |A| denote the size of a set A, , denotes the equality ] deterministic adaptive algorithm that matches the information T by definition and [N] , {1,2,...,N} is the set of integers theory bound. Finally, in Sect. IV we discuss a probabilistic I from 1 to N. A hypergraph is the pair H , H(V,E) such . generalizationoftheproblemoflearningahiddenhypergraph. s that E ⊂2V \∅, where V is the set of vertices and c [ E ,{(e ,...e ) : e ⊂V, i∈[s]} II. PREVIOUS RESULTS 1 s i For the particular case ℓ = 1, the above definitions were 3 is an s-set of edges. A set S ⊂ V is called an independent already introduced to describe the model called designing v 5 set of H if it does not contain entire edges of H. We denote screening experiments. It is a classical grouptesting problem. 0 bydim(H)thecardinalityofthelargestedge,i.e.,dim(H)= We refer the reader to the monograph [10] for a survey 7 max|ei|. on group testing and its applications. It is quite clear (e.g., 6 i∈[s] see [10]) that an F(t,s,1)-searching adaptive algorithm can 0 B. Statement of the problem achieve the information theory bound, i.e., N(t,s,1) = . 1 The problem of learning a hidden hypergraph is described slog t(1+o(1)) as t→∞. Therefore, Ra(s,1)=1/s. 0 2 [8] as follows. Suppose there is an unknown (hidden) hyper- If ℓ = 2, then we deal with learning a hidden graph. 6 graph H = H(V,E) whose edges are not known to us, One important application area for such problem is bioinfor- 1 un : but we know that the unknown hypergraph Hun belongs to matics [9], more specifically, chemical reactions and genome v the known family F of hypergraphs having certain specific sequencing.Alonetal.[7],andAlonandAsodi[6]givelower i X structure (e.g, F consists of all Hamiltonian cycles on V). and upper bounds on the minimal number of tests for non- r Our goal is to identify all edges of E by carrying out the adaptive searching algorithms for certain families of graphs, a minimal number N of edge-detecting queries Q(S), where such as stars, cliques, matchings. In [9], Boevel et al. study S ⊆V: Q(S) = 0 if S is independentof H , and Q(S) = 1 the problem of reconstructing a Hamiltonian cycle. In [5], un otherwise. Angluin et al. give a suboptimal F(t,s,2)-searching adaptive Inthegivenpaperwefocusourattentiononlyonthefamily algorithm. More precisely, they prove Ra(s,2)>1/(12s). F(t,s,ℓ) of hypergraphs introduced in the abstract, namely: For the general case of parameters s and ℓ, Abasi et al. givenintegerss,ℓandt,suchthats+ℓ<t,thesetofvertices have recently provided [11] a suboptimal F(t,s,ℓ)-searching V ,[t] and the family F(t,s,ℓ), consists of all hypergraphs adaptive algorithm. In particular, from their proofs it follows H(V,E) such that dim(H) 6 ℓ and |E| 6 s. Suppose we Ra(s,ℓ) > 1/(2sℓ). This bound differs up to the constant knowthatthehypergraphH belongstothefamilyF(t,s,ℓ). factor from the information theory upper bound Ra(s,ℓ) 6 un An algorithm is said to be an F(t,s,ℓ)-searching algorithm 1/(sℓ). In Sect. III we improve the lower bound and provide if it finds H , i.e. there exists only one hypergraph from an F(t,s,ℓ)-searchingadaptivealgorithmwhich is optimalin un F(t,s,ℓ) that fits all answers to the queries. terms of the asymptotic rate. A. Cover-Free Codes leads to log t 1 A binary N ×t-matrix 2 6 +o(1), t→∞, N sℓ X =kx (j)k, x (j)=0,1, i∈[N], j ∈[t] (1) Therefore, we complete the proof. i i The key result of this paper is given as follows. is called a code of length N and size t. By x and x(j) i Theorem 2. There exists an adaptive F(t,s,ℓ)-searching we denote the i-th row and the j-th column of the code X, algorithmwhichhasatmostsℓlog t(1+o(1))edge-detecting respectively. 2 queries. In other words, the rate Ra(s,ℓ)=1/(sℓ). Before we give the well-known definition of cover-free In order to prove Theorem 2 we provide a deterministic codes, note that any F(t,s,ℓ)-searching non-adaptive algo- F(t,s,ℓ)-searching adaptive algorithm. rithm consisting of N queries can be represented by a binary Proof of Theorem 2. N×tmatrixX suchthateachtestcorrespondstotherow,and Let H = H(V,E) be a hidden hypergraph from the un each vertex stands for the column. We put x (j) = 1 if the i family F(t,s,ℓ). We will call a vertex v ∈V active, if there j-th vertex is included to the i-th test; otherwise, xi(j)=0. existsatleastoneedgee∈E suchthatv ∈e.By F,F ⊂V, Definition 1. [1]. A code X is called a cover-free (s,ℓ)- denotethesetofalreadyfoundactivevertices.ByE′,E′ ⊂E, code (briefly, CF (s,ℓ)-code) if for any two non-intersecting denote the set of already found edges of H , i.e., if e ∈ E un sets S, L ⊂ [t], |S| = s, |L| = ℓ, S ∩L = ∅, there exists a and e ⊂ F, then e ∈ E′. Note that a pair (V,E′) can be row x , i∈[N], for which i viewed as a partial hypergraph of H . Let S, S ⊂ V, be a un x (j)=0for anyj ∈S, query we deal with. Before we start the proposed algorithm, i (2) we set F =∅, E′ =∅ and S =V. x (k)=1for anyk ∈L. i Firstly we describe an algorithm depicted as Alg. 2 which Taking into account the evident symmetry over s and ℓ, we allows us to to find a new active vertex v, i.e., v 6∈ F. The introduce Ncf(t,s,ℓ) = Ncf(t,ℓ,s) - the minimal length of inputof the algorithmare set F and a query S which contain CF (s,ℓ)-codesof size t anddefine the asymptotic rate of CF at least one edge e∈E and e6∈E′. We set S′ =S\F and (s,ℓ)-codes: S′′ =S\S′.AteachfurtherstepweguaranteethatS′contains a new active vertex. While |S′| > 1, we run the following log t Rcf(s,ℓ)=Rcf(ℓ,s),tl→im∞Ncf(t,2s,ℓ). (3) ip.reo.,cSed′u=re.SSp⊔litSup, |SS′|in=to⌈t|wSo′|/e2q⌉uaalnsdiz|eSd|su=bs⌊e|tSs′S|/12⌋a.ndThSe2n, 1 2 1 2 In [2], Dyachkov et al. show that any CF (s,ℓ)-code rep- wecarryoutaqueryS ⊔S′′.IfQ(S ⊔S′′)=1thenitmeans 1 1 resents a F(t,s,ℓ)-searching non-adaptive algorithm, while thatS containsatleast onenew active vertex,since fromthe 1 any F(t,s,ℓ)-searching non-adaptive algorithm corresponds previousstepsoftheprocedurewehaveQ(S′′)=0.Therefore to both a CF (s,ℓ−1)-code and CF (s−1,ℓ)-code.The best we set S′ =S and repeat the procedure. If Q(S ⊔S′′)=0 1 1 presently known upper and lower bounds on R(s,ℓ) of CF then at least one new active vertex must lie in S , since from 2 (s,ℓ)-codes were presented in [2], [3]. If ℓ > 1 is fixed and the previous steps we also have Q(S ⊔S ⊔S′′) = 1. Thus 1 2 s → ∞, then these bounds lead to the following asymptotic we set S′ = S , S′′ = S ⊔S′′ and repeat the procedure. At 2 1 equality: final (|S′|=1) we know that the unique vertex v of S′ is an active vertex of H and v 6∈ F. Notice that Alg. 2 can be (ℓ+1)ℓ+1log s un 2 (1+o(1))>Rna(s,ℓ) seen as a variation of the binary vertex search. 2eℓ−1 sℓ+1 SecondlyweprovideanalgorithmdepictedasAlg.3which ℓℓ log e allows us to to find all edges E′ composed on already found ≃R (s,ℓ)> 2 (1+o(1)). (4) cf eℓ sℓ+1 active vertices F. The only input of the algorithm is the set F. After we find a new active vertex v we can update the set III. ADAPTIVE LEARNING A HIDDEN HYPERGRAPH E′ by searching edges containing v. But since |F| 6 sℓ ≪t By a counting argument, the lower bound is true. we can set E′ = ∅ and run the following procedure over all Theorem 1. Any F(t,s,ℓ)-searching algorithm has at S such that S ⊂ F and |S| 6 ℓ. If there is no edge e ∈ E′ leastsℓlog2t(1+o(1))edge-detectingqueries.Inotherwords, such that e⊂S, then carry out a query S. If Q(S)=1 then the rate Ra(s,ℓ)61/(sℓ). we delete all edges e ∈ E′ such that S ⊂ e and add edge Proof of Theorem 1. e=S to E′. Note thatAlg. 3 representsan exhaustivesearch Let N be the minimal number of tests in the worst case of edges. among all adaptive F(t,s,ℓ)-searching algorithms. Then we Thirdly we present an algorithm depicted as Alg. 4 which have allows us to to find a query S such that S contains at least 2N >|F(t,s,ℓ)|. one edge e∈E and e6∈E′ (as a consequence S contains at leastoneactivevertexv 6∈F).Theonlyinputofthealgorithm This inequity along with the asymptotic equality is the set E′. We initialize A as vertices included to at least tsℓ one edge e ∈ E′, set B = V \A and S = ∅. One can see |F(t,s,ℓ)|= (1+o(1)), t→∞, (ℓ!)ss! that |A| 6 sℓ ≪ t. Then we run the following procedure for all sets C ⊂ V such that B ⊂ C, and ∄e ∈ E′, e ⊂ C. Data: set F ⊂V If Q(C) = 1 then we set S = C and exit the procedure. If Result: subset of edges E′ ⊂E Q(C) = 0 we proceed to the next query C. If we finish the initialization E′ :=∅; procedure and have S = ∅, then it means we have found all for ∀S ⊂F: 16|S|6ℓ do edges of H , i.e., E′ =E. Note that Alg. 4 is an exhaustive if ∄e∈E′ : e⊂S then un query search. if Q(S) = 1 then We give a full description of the proposed F(t,s,ℓ)- for ∀e∈E′ : S ⊂e do searching algorithm by Alg. 1, and this algorithm is based on delete e from E′; Alg. 2, 3 and 4. We set F = ∅, E′ = ∅ and S = V. While end Alg. 3 gives a query S 6=∅, we run the following procedure. add edge e=S to E′; With the help of Alg. 2 we find a new active vertex v and end add itto F. Thenwe use Alg. 3 to updatethe set ofedgesE′ end composed on already found active vertices F. After that we end Algorithm 3: Searching edges run Alg. 4 to find an appropriatequery S, which then will be used in Alg. 2. Now we upper bound the number of tests of Alg. 1 in the Data: subset of edges E′ ⊂E worst case. Let|V|=t. It is easy to check thatAlg. 2 usesat Result: query S most ⌈log |S|⌉6⌈log t⌉ tests. One can see that the number initialization A:={v : v ∈e∈E′}; B :=V \A; 2 2 of active vertices in the hidden hypergraph H ∈ F(t,s,ℓ) S :=∅; un is at most sℓ. Alg. 3 uses at most F (s,ℓ) tests, while Alg. 4 for ∀C ⊂V: B ⊂C and ∄e∈E′, e⊂C do 1 uses at most F (s,ℓ) tests, where the functionsF and F do if Q(C) = 1 then 2 1 2 not depend on t. We can upper bound the number of cycles S :=C and break “for loop”; in Alg. 1 by the number of active vertices. Therefore, the end total number of tests for the given algorithm does not exceed end sℓ(log t+F (s,ℓ)+F (s,ℓ)+1). Algorithm 4: Searching a query 2 1 2 Data: set of vertices V of H ∈F(t,s,ℓ) un round) algorithm. In the given section we will limit ourselves Result: set of edges E of H un initialization E′ :=∅; F :=∅; S :=V; to the consideration of only so called two-stage searching procedures (see, e.g., [12]). It means we can adapt the tests while S 6=∅ do of the second round of testing only one time after we receive perform Alg. 2, find v 6∈F and F :=F ⊔v; perform Alg. 3, and find subset of edges E′; the answers to the queries of the first round. Now we consider a relaxation of the problem of learning perform Alg. 4, and find query S; a hidden hypergraph. Suppose we let an algorithm iden- end Algorithm 1: Learning a hidden hypergraph tify almost all hypergraphs from the family F(t,s,ℓ). More formally, if there exist a subfamily F′(t,s,ℓ) ⊂ F(t,s,ℓ) of cardinality at least (1 − ε)|F(t,s,ℓ)| such that for any H ∈ F′(t,s,ℓ) the algorithm finds H , then we say that Data: query S ⊆V, Q(S)=1, and set F ⊂V un un the algorithm is F(t,s,ℓ)-searching with probability (1−ε). Result: vertex v ∈V, v 6∈F, and ∃e∈E, v ∈e ByNna(t,s,ℓ,ε)(Na(t,s,ℓ,ε),N2st(t,s,ℓ,ε))denotethe initialization S′ :=S\F; S′′ :=S\S′; minimal number of queries in a F(t,s,ℓ)-searching non- while |S′|>1 do adaptive(adaptive,two-stage) algorithmwith probability(1− split in half S′: S′ =S ⊔S ; 1 2 ε).Introducethecapacityofnon-adaptiveF(t,s,ℓ)-searching if Q(S ⊔S′′)=1 then 1 algorithms S′ :=S ; 1 log t else Cna(s,ℓ), lim 2 . S′ :=S , S′′ :=S′′⊔S ; ε→0 Nna(t,s,ℓ,ε) 2 1 t→∞ end InsimilarwaywedefinethecapacitiesCa(s,ℓ)andC2st(s,ℓ) end of adaptive F(t,s,ℓ)-searching algorithms and two-stage Algorithm 2: Searching a new active vertex F(t,s,ℓ)-searching algorithms, respectively. One can easily check that Theorem 1 is true for “almost” IV. CONCEPT OF “ALMOST” LEARNING A HIDDEN concept as well. From definitions it follows HYPERGRAPH 1 Cna(s,ℓ)6C2st(s,ℓ)6Ca(s,ℓ)= , Remind that there are two natural types of algorithms, sℓ namely non-adaptive and adaptive. A compromise between where the right-hand side equality holds in virtue of Theo- these two types is usually called a multistage (or multiple rem 2. Forthecaseℓ=1thedefinitionofthecapacitywasconsid- FormatrixX we willcallallsuch hypergraphsgood.For any ered in many papers. We refer the reader to the classic result binary (N ×t) matrix X denote the set of good hypergraphs [4] in model of designing screening experiments. Malyutov by G(X), G(X) ⊂ F′(t,s,ℓ). Notice that we have proved proved that Cna(s,1)=1/s. |G(X )|>(1−ε)|F′(t,s,ℓ)|. 1 We conjecture that Cna(s,ℓ) = 1/(sℓ). The following Now we reformulate the statement derived in [4]. theorem reinforces the hypothesis. Lemma 1. The capacity of non-adaptive F(t,ℓ,1)- Theorem 3. The capacity of two-stage F(t,s,ℓ)- searching algorithms Cna(ℓ,1)=1/ℓ. searching algorithms C2st(s,ℓ)=1/(sℓ). Notice that if we are given with a code X representing WeproveTheorem3usingtheprobabilisticmethodandthe a non-adaptiveF(t,ℓ,1)-searchingalgorithm with probability result established in [4] for the case ℓ=1. (1−ε) then we can replace each entry of X by the following Proof of Theorem 3. rule: 0 → 1, 1 → 0, and get the code Y, which represents Firstly we note that the subfamily F′(t,s,ℓ) ⊂ F(t,s,ℓ) a non-adaptiveF(t,1,ℓ)-searchingalgorithm with probability which consists of s pairwise non-intersecting edges of size ℓ, (1−ε). i.e., Roughlyspeaking,foranygoodhypergraphfromG(X ) it 1 willbesufficienttoapplysnon-adaptiveF(t ,1,ℓ)-searching H ∈F(t,s,ℓ): H =(e ,...,e ), i F′(t,s,ℓ)= 1 s , algorithms with probabilities (1−ε) in parallel at the second (|ei|=ℓ, ei∩ej =∅ for any i6=j) stage of our strategy in order to find s edges in each Vi, ti = |V |. has cardinality |F(t,s,ℓ)|(1 + o(1)) as t → ∞. Thus, for i More formally, let E(N,t,X) be the ensemble of binary any ε > 0 and for sufficiently large t it is enough to prove theexistenceoftwo-stageF′(t,s,ℓ)-searchingalgorithmwith (N × t) codes that consists of all possible permutations of columns of a code X representing a non-adaptive F(t,1,ℓ)- probability (1−ε). searching algorithm with probability (1−ε), and each copy Now we show that there exists a binary matrix (each test of X is chosen equiprobably with probability 1/t!. Let Y correspondstotherow,andeachvertexstandsforthecolumn) be a random matrix of E(N,t,X). For a good hypergraph correspondingtothefirststageofgrouptestingsuchthatafter H ∈ G(X ) let we be given with an appropriate partition carrying out tests of the first stage for almost all hypergraphs 1 from F′(t,s,ℓ) we can find a good partition of vertices into into disjoint sets V1 ⊔ V2··· ⊔ Vs = V = [t] such that s disjoint sets: V ⊔V ···⊔V = V = [t], such that e ∈ e1 ∈ V1, e2 ∈ V2,...es ∈ Vs. At the second stage of our 1 2 s 1 V , e ∈ V ,...,e ∈ V . Define the ensemble E(N,t,s) strategy for vertices V1 we will apply N tests of Y without 1 2 2 s s usingverticesV \V , forverticesV we willapplyN tests of of s-ary (N ×t)-matrices X = ||x (j)||, where each entry 1 2 i Y without using vertices V \V and so on. Define the event x (j) is chosen independently and equiprobably from the set 2 i B(s,ℓ): “all edges of H can be found applying sN tests”. {1,2,...,s}. Each s-ary symbol x in X is then replaced by Estimate the probability of the opposite event to B(s,ℓ): the binary column of length s, which has only one 1 at x- th position. In other words, the s-ary (N × t)-matrix X is P(B(s,ℓ))6sε. replaced by the binary (sN ×t)-matrix X , and each s-ary 1 row of X is replaced by the binary (s×n)-layer of X1 such It means that there exists such a code X2 which is a copy that each column of the layer contains only one 1. Define the (obtained by a column permutation) of X such that for (1− eventA(s,ℓ):“givenahypergraphH ∈F′(t,s,ℓ),thereexists sε)·|G(X1)| good hypergraphs we find all edges after the a layer in X such that the answers to all s edge-detecting second stage of our strategy. 1 queriesinthelayerare1’s”.Ifthisconditionholds,thenweare Finally, for any ε > 0 there exists an ascending sequence presentedwithagoodpartitionintodisjointsets:V ⊔V ···⊔ of t such that for (1−ε)|F(t,s,ℓ)| hypergraphs there exists 1 2 Vs = V = [t], such that e1 ∈ V1, e2 ∈ V2,...,es ∈ Vs. a code X1, which can be used for the first stage of F(t,s,ℓ)- Denote the cardinality of Vi by ti = |Vi|. Now estimate the searching algorithm with probability (1−ε), and a code X2, probability of the opposite event to A(s,ℓ): the modificationof which can be appliedfor the second stage ofthestrategy,suchthatthetotalnumberoftestsissufficiently N s! determined by only queries of the code X and is equal to P A(s,ℓ) = 1− . 2 ssℓ sℓlog t(1+o(1)). (cid:16) (cid:17) (cid:18) (cid:19) 2 Let N → ∞ and N = o(log2t). Then for any ε > 0 there REFERENCES exists N(ε) such that for any N > N(ε) the probability [1] Mitchell C.J., Piper F.C., “Key storage in Secure Networks”, Discrete P A(s,ℓ) 6 ε. It means that for any ε > 0, and for AppliedMathematics, v.21,pp.215-228, 1988. suf(cid:16)ficiently(cid:17)larget and forN =o(log2t) (the numberof tests [2] DfinyiatechskeotvsiAn.wGh.i,chVilneonkininterPs.e,cMtioancuolafsAe.t,saisndcoTvoerrnedeybDy.t,h“eFuanmioilniesofosf of the first stage is negligible)there exists a binary (sN×t)- others”, J.Combin.Theory.Ser.A,99(2002),pp.195-218. matrixX1whichcanbeusedinthefirststageofthealgorithm, [3] Lebedev V.S. Asymptotic Upper Bound for the Rate of (w,r) Cover- andforatleast(1−ε)|F′(t,s,ℓ)|hypergraphsfromF′(t,s,ℓ) FreeCodes//ProblemsofInformationTransmission.2003.V.39.n4. P.317-323. thereexistsedge-detectingqueries,answersofwhichare1’s, [4] Malyutov M.B.,“TheSeparating PropertyofRandomMatrices”, Math- and supports of these queries are pairwise non-intersecting. ematical Notes,vol.23,no.1,pp.84-91,1978. [5] Angluin D.,Chen J., “Learning ahidden graph usingO(log n)queries peredge”, JComputSystSci,v.74,pp.546-556,2008. [6] Alon,N.,andAsodi,V,“Learningahiddensubgraph”.SIAMJ.Discrete Math.18,4(2005),pp.697-712. [7] Alon,N.,Beigel,R.,Kasif,S.,Rudich,S.,andSudakov, B.,“Learninga hiddenmatching”. SIAMJ.Comput.33,2(2004),pp.487-501. [8] Angluin, D., and Chen, J. “Learning a hidden hypergraph. Journal of Machine LearningResearch 7(2006),pp.2215-2236. [9] Bouvel, M., Grebinski, V., and Kucherov, G. “Combinatorial search on graphsmotivatedbybioinformaticsapplications:Abriefsurvey”.InWG (2005),pp.16?27. [10] Du, D.-Z. and Hwang, F.K., “Combinatorial Group Testing and Its Applications”, Singapore: WorldSci.,2000,2nded. [11] Abasi, H., Bshouty, N.H., and Mazzawi, H., “On Exact Learning MonotoneDBFfromMembershipQueries”,LectureNotesinArtificial Intelligence, 2014,pp.111-124. [12] DeBonisA.,GasieniecL.,VaccaroU.,Optimaltwo-stagealgorithmsfor group testing problems, SIAMJ. Comp.,vol. 34,no. 5pp. 1253-1270, 2005.