ebook img

Optimal Threshold Control by the Robots of Web Search Engines with Obsolescence of Documents PDF

0.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Optimal Threshold Control by the Robots of Web Search Engines with Obsolescence of Documents

Optimal Threshold Control by the Robots of Web Search Engines with Obsolescence of Documents 2 1 0 Konstantin Avrachenkov∗, Alexander Dudin†, Valentina Klimenok‡, 2 Philippe Nain§, Olga Semenova¶ n a J 9 1 Abstract ] A typical Web Search Engine consists of three principal parts: I N crawling engine, indexing engine, and searching engine. The present work aims to optimize the performance of the crawling engine. The . s crawling engine finds new Web pages and updates Web pages existing c [ in the database of the Web Search Engine. The crawling engine has severalrobotscollectinginformationfromtheInternet. Wefirstcalcu- 1 late various performance measures of the system (e.g., probability of v 0 arbitrarypagelossduetothebufferoverflow,probabilityofstarvation 5 of the system, the average time waiting in the buffer). Intuitively, we 1 wouldliketoavoidsystemstarvationandatthesametimetominimize 4 theinformationloss. Weformulatetheproblemasamulti-criteriaopti- . 1 mizationproblemandattributingaweighttoeachcriterionwesolveit 0 intheclassofthresholdpolicies. WeconsideraverygeneralWebpage 2 arrivalprocessmodeledby BatchMarkedMarkovArrivalProcessand 1 a very general service time modeled by Phase-type distribution. The : v model has been applied to the performance evaluation and optimiza- i X tionofthecrawlerdesignedbyINRIAMaestroteaminthe framework of the RIAM INRIA-Canon research project. r a 1 Introduction The problem of control by the robots (crawlers) that traverse the Web and bring Web pages to the indexing engine that updates the data base of a ∗INRIASophia Antipolis, France, [email protected] †Belarusian StateUniversity,Belarus, [email protected] ‡Belarusian StateUniversity,Belarus, [email protected] §INRIASophia Antipolis, France, [email protected] ¶Belarusian StateUniversity,Belarus, [email protected] 1 Web Search Engine is formulated and analyzed in [13]. This problem is formulated in [13] as the controlled queueing system. The system has a single server with the exponential service time distribution, finite buffer of capacity K − 1, K ≥ 2. There are N available robots and each of these robots, when activated, brings pages to the server in a Poisson stream at fixed rate. These N stationary Poisson processes are mutually independent and independent of service times. Thenumberofactive robotsmay bemodifiedatany arrivalordeparture epoch. When an arrival occurs, the incoming robot is de-activated at once; the controller may then decide to keep it idle or to activate it. When a departure occurs the controller may either decide to activate one additional robot, if one is available, or to do nothing (i.e. the number of active robots is left unchanged). In [13], the problem of finding a policy that minimizes a weighted sum of the loss rate and starvation probability (probability of the empty system) is considered. It is solved by means of the tools of the Markov Decision Problems theory. As the possible generalizations of the model, which are certainly worth- while analyzing, the following ones are mentioned in [13]: • More general input processes, e.g., a MMPP (Markov Modulated Poisson Process) should be considered so as to reflect more accurately “traveling times” of robots in the network; • Because of the obsolescence of stored documents issue, the waiting time should be bounded, even if the buffer size is effectively infinite; • Other cost functions could beinvestigated, for instance, cost functions including response times. In this paper, we made all the mentioned and some further generaliza- tions. We assume that, under the fixed number of currently active robots, the arrival processis of theBMAP type. TheBMAP is amoregeneralprocess comparing to the MMPP and allows delivering of a batch of Web pages to be indexed while the MMPP assumes that the pages are delivered one-by- one. It is very typical for a computer system to operate in batch mode. We assume that the service time distribution is of the PH (Phase) type which is much more general comparing to the exponential distribution as- sumed in [13]. The class of phase type distributions is dense in the field of all positive-valued distributions and practically we can deal with any real distribution [2]. 2 Sincewebpagescanbecomeobsolete,weboundthewaitingtimestochas- tically. Waiting time of each web page in a buffer is restricted by a random variable having PH distribution identical and mutually independent for all Web pages. The phase type distribution has been used to model obsoles- cence times for instance in [15]. We supposethat the cost function can have a more general form than in [13] and include the obsolescence probability and response time. In the next section we formulate the model and optimization problem. Section3containsthesteady-stateanalysisofthemulti-dimensionalMarkov chain which defines dynamics of the system under the fixed values of the parameters defining the strategy of control. In Section 4, main performance measures of the system are computed. In Section 5, the conditional sojourn time distributions are calculated. In Section 6, the case of ordinary arrivals is touched in brief. In Section 7, the theoretical results are illustrated by numericalexamples. Inparticular, themathematical modelisappliedtothe performance evaluation and optimization of the robot designed by INRIA MaestroteamintheframeworkoftheRIAMINRIA-Canonresearchproject. Section 8 concludes the paper. 2 Mathematical Model We consider a single server system with the finite buffer of capacity K − 1, K ≥ 2. So, the total number of Web pages which can stay in the system is restricted by the number K. Web pages are served by a server in order of their arrivals. Service times of Web pages are independent identically distributed ran- domvariableshavingPH distributionwithirreduciblerepresentation(β,S). It means the following. Service of a Web page is defined as a time until the continuous-timeMarkov chainm , t ≥ 0,havingthestates (1,...,M)asthe t transientandstate0asabsorbingonereachestheabsorbingstate. Aninitial state of the chain is selected in a random way, according to the probabil- ity distribution defined by the row-vector (β,0), where β is the stochastic row vector of dimension M. Transitions of the Markov chain m , t ≥ 0, t S S are described by the generator 0 where the matrix S is a sub- 0 0 (cid:18) (cid:19) generator and the column vector S is defined by S = −Se and has all 0 0 M non-negative and at least one positive components, e is the column vector M of dimension M consisting of all 1’s. The average service time b is given 1 by b = β(−S)−1e . For more details about the PH type distribution, its 1 M 3 properties, special cases and applications see [11, 12]. Web pages can be delivered into the system by N available robots. The number of active robots varies in the set {1,...,N}. We assume that the process of Web pages delivering under l, l = 1,N, active robots is described as follows. Let ν , t ≥ 0, be an irreducible continuous time Markov chain t having finite state space {0,1,...,W}. Sojourn time of the chain ν , t ≥ 0, t (l) in the state ν has exponential distribution with a parameter λ . After this ν time expires, with probability p(l)(ν,ν′) the chain jumps into the state ν′ 0 without generation of Web pages and with probability p(l)(ν,ν′) the chain k jumps into the state ν′ and a batch consisting of k Web pages is generated, k ≥ 1. The introduced probabilities satisfy conditions: ∞ W W p(l)(ν,ν) = 0, p(l)(ν,ν′)+ p(l)(ν,ν′) = 1, ν = 0,W, l = 1,N. 0 k 0 k=1ν′=0 ν′=0 XX X (l) TheparametersdefiningthisflowarekeptinthesquarematricesD , k ≥ k 0, l = 1,N, of size W¯ = W +1 defined by their entries: (D(l)) = −λ(l), (D(l)) = λ(l)p (ν,ν′), (1) 0 ν,ν ν 0 ν,ν′ ν 0 (D(l)) = λ(l)p(l)(ν,ν′), ν,ν′ = 0,W, k ≥ 1, l = 1,N. k ν,ν′ ν k Denote ∞ D(l)(z) = D(l)zk, |z| ≤ 1. k k=0 X The matrix D(l)(1) is the infinitesimal generator of the process ν ,t ≥ 0, t underthefixednumberlofactiverobots. Thestationarydistributionvector θ(l) of this process satisfies the equations θ(l)D(l)(1) = 0,θ(l)e = 1. Here and in the sequel, 0 is the zero row vector. The average intensity λ(l) (fundamental rate) of the BMAP under the fixed number l of active robots is defined by dD(l)(z) λ(l) = θ(l) | e, z=1 dz (l) and the intensity λ of group arrivals is defined by g λ(l) = θ(l)(−D(l))e. g 0 The variance v(l) of intervals between group arrivals is calculated as v(l) = 2(λ(l))−1θ(l)(−D(l))−1e−(λ(l))−2, g 0 g 4 (l) while the correlation coefficient c of intervals between successive group cor arrivals is given by 1 c(l) = λ(l)θ(l)(−D(l))−1(D(1)(l) −D(l))(−D(l))−1e−1 . cor v(λ(l))2 g 0 0 0 g (cid:18) (cid:19) The introduced representation of the arrival process via the matrices (l) D , k ≥ 0, l = 1,N, unifies several possible interpretations of the process k of Web pages delivered by the fixed number of active robots: 1. The processes of Web pages delivering by all robots are independent BMAP processes. Let the process of Web pages delivering by the lth robotbetheBMAP whichisgovernedbythecontinuoustimeMarkov (l) chain ν , t ≥ 0, having finite state space {0,1,...,W } and defined t l by the matrices D(l), k ≥ 0, of size W¯ = W +1. See [10] for more k l l details about the BMAP, its properties and special cases. We denote ∞ D(l)(z) = D(l)zk, |z| ≤ 1. k k=0 Let us assuPme that the robots are arranged in such a way that the first robot is always active, then, when a queue decreases, the second robot can be activated, etc, the Nth robot is the most rare activated. N The matrices D(l), k ≥ 0, l = 1,N, of size W¯ = W¯ defined k k k=1 by formulae (1) are expressed via the matrices D(l), kQ≥ 0, of size k W¯ , l = 1,N, describing the BMAPs in the following way: l D(l) = D(1)⊕D(2)⊕···⊕D(l)⊗D(l+1)(1)⊗···⊗D(N)(1), 0 0 0 0 (l) (1) (2) (l) D = D ⊕D ⊕···⊕D ⊗I ⊗···⊗I , k ≥ 1. k k k k W¯l+1 W¯N Here ⊗ and ⊕ denote Kronecker product and sum of matrices cor- respondingly (see, e.g., [7]), I denotes identity matrix of size L. If L the size of the matrix is clear from context the suffix can be omitted. def I = 1. 0 2. The common process of Web pages delivered by all robots together is the BMAP process directed by the continuous time Markov chain ν , t ≥ 0, having finite state space {0,1,...,W} and defined by the t matrices D , k ≥ 0, of size W¯ . Some set of thinning probabilities k q ,...,q , 0 < q < ··· < q = 1, is fixed. When l robots are 1 N 1 N active, procedure of thinning the BMAP process with the thinning 5 probability q is applied. It means that an arbitrary arriving batch is l accepted with probability q and is rejected with the complimentary l ∞ probability 1−q . We denote D(z) = D zk, |z| ≤ 1. l k k=0 P (l) The matrices D , k ≥ 0, l = 1,N, defined by formulae (1) are ex- k pressed via the matrices D , k ≥ 0, describing the common BMAP k and via the thinning probability q in the following way: l (l) (l) D = D q +D(1)(1−q ), D = D q , k ≥ 1. 0 0 l l k k l 3. Let the process of Web pages delivering by robots is described by a BMMAP. The BMMAP is directed by continuous time Markov chain ν , t ≥ 0, having finite state space {0,1,...,W}. Sojourn time t of the chain ν , t ≥ 0, in the state ν has exponential distribution with t a parameter λ . After this time expires, with probability p (ν,ν′) the ν 0 chain jumps into the state ν′ without generation of Web pages and with probability p(l)(ν,ν′) the chain jumps into the state ν′ and a k batch consisting of k, k ≥ 1 Web pages are delivered by the lth robot. The introduced probabilities satisfy conditions: N ∞ W W p (ν,ν)= 0, p(l)(ν,ν′)+ p (ν,ν′)= 1, ν = 0,W. 0 k 0 l=1 k=1ν′=0 ν′=0 XXX X (l) The matrices D , k ≥ 0, l = 1,N, defined by formulae (1) are ex- k (l) pressed via the matrices D , D , k ≥ 1, l = 1,N, formed by the 0 k probabilities p (ν,ν) and p(l)(ν,ν′) in the following way: 0 k ∞ N l (l) (m) (l) (m) D = D + D , D = D , k ≥ 1. 0 0 j k k j=1m=l+1 m=1 X X X 4. The process of Web pages delivery by all robots is the BMAP pro- cess directed by the continuous time Markov chain ν , t ≥ 0, having t finite state space {0,1,...,W} and defined by the transition intensity (l) matrices D , k ≥ 0, l = 1,N, depending on the number of active k robots. Interpretation 3 seems be the most attractive because it assumes that the work of the robots can be dependent, which is quite realistic. Because the 6 totalamountofWebserversfromwhichnewpagesshouldbebroughtismore orless constant, reductionofthenumberofactive robotscauses theincrease of field in Internet, which is scrawled by each robot, and corresponding change of travel time. So, the BMMAP looks to be the most realistic model of Web pages delivery. If a batch of delivered Web pages meets free server one Web page starts the service immediately while the rest moves to the buffer. If the server is busy at an arrival epoch, all Web pages of the batch are placed into the buffer if there is enough free space in the buffer. If the number of free places in the buffer is less than the number of Web pages in the batch, the corresponding number of Web pages is lost. It means that we consider so called partial admission strategy. The alternative strategies of complete rejection or complete admission can be investigated in analogous way. For each Web page placed into the buffer, the waiting time is restricted by the random variable (so called obsolescence time) having PH distribu- tion withirreduciblerepresentation (γ,Γ).Itmeansthefollowing. Available waiting time of the ith Web page in the buffer is defined as a time until the (i) continuous-time Markov chain r , t ≥ 0,havingthestates (1,...,R)as the t transient and state 0 as absorbing one, reaches the absorbing state. Transi- tion of this process into the absorbing state means that this Web page gets out of date (obsolescence or dashout occurs). An initial state of the chain is selected in a random way, according to the probability distribution defined by the row-vector (γ,0), where γ is the stochastic row vector of dimension (i) R. Transitions of the Markov chain r , t ≥ 0, are described by the genera- t Γ Γ tor 0 where the matrix Γ is sub-generator and the column vector 0 0 (cid:18) (cid:19) Γ is defined by Γ = −Γe. The average time until obsolescence g is given 0 0 1 by g = γ(−Γ)−1e. 1 If the obsolescence time expires before a Web page is picked-up from the buffer to the server, it is assumed that this Web page immediately leaves the buffer and is lost. The obsolescence times of different Web pages are in- dependent of each other and identically distributed. It is worth to note that the analysis presented below could be drastically simplified if we suggest that the obsolescence time is exponentially distributed. However, this sug- gestion rarely holds true in the real world systems because this suggestion means that, with high probability, information obsoletes very quickly. Reasonable class of strategies of control by robots is the class of the threshold strategies defined as follows. Integers j ,...,j are fixed such 1 N−1 as −1 = j < j < ··· < j < j = K. If the number i of Web pages in 0 1 N−1 N 7 the system satisfies inequality j +1 ≤ i ≤ j , then N −r+1 robots are r−1 r active and r−1 robots are de-activated, r = 1,N. Note that the described threshold strategies are popular in literature in controlled queues, see, e.g., [1, 3, 5, 6, 14]. For some systems, it is proven that the optimal strategy in the class of all Markovian strategies belongs to the class of threshold strategies. For some other systems such a result is not proven,buttheoptimalstrategyissoughtintheclassofthresholdstrategies. Advantage of such strategies is their intuitive justification and relative sim- plicity ofimplementation inreal-lifesystems. Numericalexamplespresented in the paper [13] for a partial case of our model confirm that the thresh- old strategies are optimal in the class of all Markovian strategies, although authors cannot prove this fact. Our system is much more complicated and we also cannot prove optimality of the optimal threshold strategy in wider classes of strategies. We just try to find an optimal threshold strategy and believe that it is optimal or sub-optimal in wider classes as well. We also mention that the description of the given above threshold strat- egy suits only for the case when N ≤ K. While the numerical examples presented in [13] address, e.g., the case N = 16 and K = 5. However, if we look at the optimal strategy given by Figure 2 in [13], we see that the opti- mal number of the active robots varies between 4 and 14 with 4 switching points where two or three robots are activated or de-activated. Because, in contrast to [13], we do not assume that each active robot generates a sta- tionary Poisson process of arrivals at a fixed rate but we assume that each robot generates a batch Markovian arrival process, we see that our strategy suits for thecase N > K as well. We could achieve this formally by allowing non-strict inequalities: −1 = j ≤ j ≤ ··· ≤ j ≤ j = K in fixing the 0 1 N−1 N thresholds. So, speaking below about a robot we may think about a virtual robot as a group of several available real robots which are activated and de-activated simultaneously. We will solve the problem of choosing the optimal threshold strategy. The cost function is assumed to be of the following form: J = λ(c P +c P )+aV¯(1)+c N +c P (2) loss loss obs obs rob act star star where λ is the mean number of pages delivered into the system by robots duringa unitof time (fundamentalrate of the arrivalprocess), P is prob- loss ability of arbitrary page loss duetothe bufferoverflow, P is probability of obs arbitrary page obsolescence during waiting in a queue, P is probability star of starvation of the system, V¯(1) is the response time (average sojourn time of Web pages which are not lost or deleted due to obsolescence), N is the act 8 average number of active robots, c ,c ,a,c ,c are the correspond- loss obs rob star ing non-negative cost coefficients. The values of the cost coefficients can be set up by experts in the domain. Alternatively, the cost coefficients can be viewed asLagrange multipliers in theconstrained problemandcan befound from the dual problem formulation. It is clear that if the average number of active robots is increasing then the three first summands in (2) are increasing while the last summand, charge for starvation of the system, is decreasing. The last charge is im- portant because starvation means that the indexing machine is idle and so freshness of the data base suffers. The problem of minimization of the cost criterion (2) is not trivial. To solvethisproblem,wewillusesocalled directapproach. Tothisend,wewill calculate the stationary distribution of the system state under an arbitrary fixed set of thresholds (j ,...,j ). It will allow us to calculate the main 1 N−1 performance measures of the system and the value J of the cost criterion as afunctionJ(j ,...,j ).Problemoffindingtheoptimalset(j∗,...,j∗ ) 1 N−1 1 N−1 is then easy solved on computer, e.g., by enumeration. 3 Stationary distribution of the number of Web pages in the system Let some set of thresholds (j ,...,j ) be fixed. We are interested in the 1 N−1 stationarydistributionoftheprocessi , t ≥ 0,wherei isthenumberofWeb t t pagesinthesystemattheepocht, i = 0,K.Thisprocessisnon-Markovian. t To investigate this process, we will consider the following multi-dimensional continuous time Markov chain ξ = {i ,ν ,m ,r(1),...,r(it−1)}, t ≥ 0, t t t t t t where ν is the state of the directing process of arrival process at epoch t, t ν = 0,W; m is the state of the process which directs a service at epoch t, t t (i) m = 1,M; r is the state of the process, which directs a obsolescence of t t (i) the ith Web page in a queue at epoch t, r = 1,R, i = 0,K −1. t We assume that the Web pages in the buffer are numerated in order of their arrival into the system. If a batch of Web pages arrives to the system, theaccepted Web pages arenumerated in auniformrandommanner. When a Web page is picked up to the service or is deleted from the queue because its admissible waiting time expires, the rest of Web pages is immediately enumerated correspondingly. 9 Denote p(0,ν) = lim P{i = 0,ν = ν}, t t t→∞ p(1,ν,m) = lim P{i = 1,ν = ν,m =m}, (3) t t t t→∞ p(i,ν,m,r ,...,r )= 1 i−1 (1) (i−1) lim P{i = i,ν = ν,m = m,r = r ,...,r = r }, i ≥ 2. t t t t 1 t i−1 t→∞ Because the state space of the the Markov chain ξ ,t ≥ 0, is finite and due t to assumption about irreducibility of the processes defining arrival, service and obsolescence processes, limits (3) exist. Enumerate the states of the Markov chain ξ ,t ≥ 0, in the lexicographic t order and form the probability row vectors p ,i = 0,K, of probabilities i corresponding to the state i of the first component of the process ξ ,t ≥ 0. t Denote also p = (p ,...,p ). 0 K Let Q be the infinitesimal generator of the Markov chain ξ ,t ≥ 0, and t Q be the block of the generator Q consisting of enumerated in lexico- i,i′ graphic order intensities of transition of the Markov chain ξ ,t ≥ 0, from t the states with the value i of the component i to the states with the value t i′ of this component, i,i′ ≥ 0. Dimension of the block Q is defined by i,i′ (W¯ MaiRbi)×(W¯ Mai′Rbi′) where al = min{l,1}, bl = max{(l−1),0}, l = 0,K. Lemma. Non-zero blocks Q of the infinitesimal generator Q of the i,i′ Markov chain ξ ,t ≥ 0, are defined by t (N) D , j = 0, 0 Q =  Dj(N)⊗β⊗γ⊗(j−1), j = 1,K −1, 0,j  (D(N)(1)−K−1Dr(N))⊗β⊗γ⊗(K−1), j = K, r=0  P   I ⊗S , i= 1, Q = W¯ 0 i,i−1 I ⊗B , i > 1, (cid:26) W¯ i−1 (χ(i)) D ⊕A , i= 1,K −1, Q = 0 i−1 i,i ( D(χ(i))(1)⊕Ai−1, i= K, D(χ(i)) ⊗I ⊗γ⊗l, l = 1,K −i−1, i< K −1, l MRi−1 Q = K−i−1 i,i+l  (D(χ(i))(1)− D(χ(i)))⊗I ⊗γ⊗l, l = K −i, i < K, r MRi−1  r=0 P i > 0.  10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.