Information and Entropy Econometrics – Editor’s View Amos Golan* Department of Economics, American University, Roper 200, 4400 Massachusetts Ave., NW, Washington, DC 20016, USA. 1. Introduction Information and Entropy Econometrics (IEE) is research that directly or indirectly builds on the foundations of Information Theory (IT) and the principle of Maximum Entropy (ME). IEE includes research dealing with statistical inference of economic problems given incomplete knowledge or data, as well as research dealing with the analysis, diagnostics and statistical properties of information measures. By understanding the evolution of ME, we can shed some light on the roots of IEE. The development of ME occurred via two lines of research: i) The 18th century work (principle of insufficient reason) of Jakob Bernoulli (published eight years after his death, 1713)1, Bayes (1763) and Laplace (1774): They all investigated the basic problem of calculating the state of a system based on a limited number of expectation values (moments) represented by the data. This work was later generalized by Jeffreys (1939) and Cox (1946). This line of research is known as Statistical Inference. ii) The 19th century work of Maxwell (1859, 1876) and Boltzmann (1871), continued by Gibbs (1902) and Shannon (1948): This work is geared toward developing the mathematical tools for statistical modeling of problems in mechanics, physics and information. The two independent lines of research are similar. The objective of the first line of research is to formulate a theory/methodology that allows understanding of the general characteristics (distribution) of a system from partial and incomplete information. In the * Corresponding author. Tel.: 001-202-885-3783; Fax: 001-202-885-3790. E-mail address: [email protected] (A.Golan) 1 Note that Jakob Bernoulli is also known as Jacque and James Bernoulli. second line of research, this same objective is expressed as determining how to assign (initial) numerical values of probabilities when only some (theoretical) limited global quantities of the investigated system are known. Recognizing the common basic objectives of these two lines of research aided Jaynes (1957) in the development of his classical work, the Maximum Entropy (ME) formalism. The ME formalism is based on the philosophy of the first line of research (Bernoulli, Bayes, Laplace, Jeffreys, Cox) and the mathematics of the second line of research (Maxwell, Boltzmann, Gibbs, Shannon). The interrelationship between Information Theory (IT), statistics and inference, and the ME principle started to become clear in the early work of Kullback, Leibler and Lindley. Building on the basic concepts and properties of IT, Kullback and Leibler developed some of the fundamental statistics, such as sufficiency and efficiency as well as a generalization of the Cramer-Rao inequality, and thus were able to unify heterogeneous statistical procedures via the concepts of IT (Kullback and Leibler 1951; Kullback 1954, 1959). Lindley (1956), on the other hand, provided the interpretation that a statistical sample could be viewed as a noisy channel (Shannon’s terminology) that conveys a message about a parameter (or a set of parameters) with a certain prior distribution. In that way, he was able to apply Shannon’s ideas to statistical theory by referring to the information in an experiment rather than in a message.2 The interrelationship between Information Theory (IT), statistics and inference, and the ME principle may seem at first as coincidental and of interest only in a small number of specialized applications. But, by now it is clear that when these methods are used in conjunction, they are useful for analyzing a wide variety of problems in most 2 For a nice detailed discussion see Soofi (1994). 2 disciplines of science. Examples include (i) work on image reconstruction and spectral analysis in medicine, physics, chemistry, biology, topography, engineering, communication and information, operations research, political science and economics (e.g., brain scan, tomography, satellite images, search engines, political surveys, input- output reconstruction and general matrix balancing), (ii) research in statistical inference and estimation, and (iii) ongoing innovations in information processing and IT. The basic research objective of how to formulate a theory/methodology that allows understanding of the general characteristics (distribution) of a system from partial and incomplete information has generated a wide variety of theoretical and empirical research. That objective may be couched in the terminology of statistical decision theory and inference in which we have to decide on the “best” way of reconstructing an image (or a “message” in Shannon’s work), making use of partial information about that image. Similarly, that objective may be couched within the more traditional terminology, where the basic question is how to recover the most conservative estimates of some unknown function from limited data. The classical ME is designed to handle such questions and is commonly used as a method of estimating a probability distribution from an insufficient number of moments representing the only available information. IEE is a natural continuation of IT and ME. All of the studies in IEE (developed mostly during the 1990s) build on both IT and/or ME to better understand the data while abstracting away from distributional assumptions or assumptions on the likelihood function. The outcome of these independent lines of study was a class of information- based estimation rules that differ but are related to each other. All of these types of 3 methods perform well and are quite applicable to large classes of problems in the natural sciences and social sciences in general, and in economics in particular. The objectives of this volume are to gather a collection of articles from the wide spectrum of topics in IEE and to connect these papers and research together via its natural unified foundation: ME and IT. To achieve these objectives, the papers in this volume include summaries, reviews, state of the art methods, as well as discussion of possible future research directions. 2. Brief Summary of Recent History 2.1. Information and Entropy - Background { } Let A = a ,a ...,a be a finite set and p be a proper probability mass function on A. 1 2, M The amount of information needed to fully characterize all of the elements of this set ( ) consisting of M discrete elements is defined by I A = log M and is known as M 2 Hartley’s formula. Shannon (1948) built on Hartley’s formula, within the context of communication process, to develop his information criterion. His criterion, called entropy,3 is H(p) ” - (cid:229) iM=1pi log pi (2.1a) with xlog(x) tending to zero as x tends to zero. This information criterion measures the uncertainty or informational content that is implied by p. The entropy-uncertainty measure H(p) reaches a maximum when p = p =...= p =1/M (and is equal to 1 2 M Hartley’s formula) and a minimum with a point mass function. It is emphasized here that 3 In completing his work, Shannon noted that “information” is already an overused term. He approached his colleague John von Newman, who responded: “You should call it entropy for two reasons: first, the function is already in use in thermodynamics under the same name; second, and more importantly, most people don’t know what entropy really is, and if you use the word entropy in an argument you will win every time”. 4 H(p) is a function of the probability distribution. For example, if hh is a random variable with possible distinct realizations x ,x ,...,x with probabilities p , p ,..., p , the 1 2 M 1 2 M entropy H(p) does not depend on the values x ,x ,...,x of hh. I f, on the other hand, hh is 1 2 M a continuous random variable, then the (differential) entropy of a continuous density is H(X) ” - (cid:242) p(x)log p(x)dx (2.1b) where this differential entropy does not have all of the properties of the discrete entropy (2.1a). For a further detailed and clear discussion of the entropy concept and of Information Theory see Cover and Thomas (1991) and Soofi (1994). After Shannon introduced this measure, a fundamental question arose: whose information does this measure capture? Is it the information of the “sender”, the “receiver” or the communication channel?4 To try and answer this question, let us first suppose that H measures the state of ignorance of the receiver that is reduced by the receipt of the message. But this seemingly natural interpretation contradicts Shannon’s idea. He used H to measure the overall capacity required in a channel to transmit a certain message at a given rate. Therefore, H is free of the receiver’s level of ignorance. So what does it measure? One answer to this question is that H is a measure of the amount of information in a message. To measure information, one must abstract away from any form or content of the message itself. For example, in the old-time telegraph office, where only the number of words were counted in calculating the price of a telegram, one’s objective was to minimize the number of words in a message while conveying all necessary information. 4 Within the context of IT, “channel” means any process capable of transmitting information. 5 Likewise, the information in a message can be expressed as the number of signs (or distinct symbols) necessary to express that message in the most concise and efficient way. Any system of signs can be used, but the most reasonable one is to express the amount of information by the number of signs necessary to express it by zeros and ones. In that way, messages and data can be compared by their informational content. Each digit takes on the values 0 or 1, and the information specifying which of these two possibilities occurred is called a unit of information. The answer to a question that can only be answered by “yes” and no” contains exactly one unit of information regardless of the meaning of that question. This unit of information is called a “bit” or binary digit.5 Further, Renyi (1961, 1970) showed that, for a (sufficiently often) repeated experiment, one needs on average the amount H(p)+e of zero-one symbols (for any positive e) in order to characterize an outcome of that experiment. Thus, it seems logical to claim that the outcome of an experiment contains the amount of information H(p). The information discussed here is not “subjective” information of a particular researcher. The information observed in a single observation, or a data set, is a certain quantity that is independent of whether the observer (e.g., an economist or a computer) recognizes it or not. Thus, H(p) is a measure of the average amount of information provided by an outcome of a random drawing governed by p. Similarly, H(p) is a measure of uncertainty about a specific possible outcome before observing it, which is equivalent to the amount of randomness represented by p. 5 Shannon’s realization that the binary digits could be used to represent words, sounds, images and ideas, is based on the work of George Boole, the 19th-century British mathematician, who invented the two-symbol logic in his work “The Laws of Thought.” 6 According to both Shannon and Jaynes (1957), H measures the degree of ignorance of a communication engineer who designs the technical equipment of a communication channel because it takes into account the set of all possible messages to be transmitted over this channel during its life time. In more common econometric terminology, we can think of H in the following way. The researcher never knows the true underlying values characterizing an economic system. Therefore, one may incorporate her/his understanding and knowledge of the system in constructing the image where this knowledge appears in terms of some global quantities, such as moments. Out of all possible such images, where these moment conditions are retained, one should choose the image having the maximum level of entropy. The entropy of the analyzed economic system is a measure of the ignorance of the researcher who knows only some moments' values representing the underlying population. For a more detailed discussion of the statistical meaning of information see Renyi (1970) and Soofi and Retzer (this volume). If, in addition, some prior information q, defined on A, exists, the cross-entropy ( ) (or Kullback-Leibler, K-L, 1951) measure is I(p;q) = (cid:229) iM=1pi log pi /qi where a uniform q reduces I(p;q) to H(p). This measure reflects the gain in information with respect to A resulting from the additional knowledge in p relative to q. Like with H(p), I(p;q) is an information-theoretic distance of p from q. For example, if Ben believes the random drawing is governed by q (for example, q =1/M for all i=1, 2, …, M) while i Maureen knows the true probability p (which is different than uniform), then I(p;q) measures how much less informed Ben is relative to Maureen about the possible outcome. Similarly, I(p;q) measures the gain in information when Ben learns that 7 Maureen is correct – the true distribution is p, rather than q. Phrased differently, I(p;q) may also represent loss of information such as Ben’s loss when he uses q. For further discussion of this measure see Cover and Thomas (1991), Maasoumi (1993) and Soofi and Retzer (this volume). 2.2. Maximum Entropy - Background Facing the fundamental question of drawing inferences from limited and insufficient data, Jaynes proposed the ME principle, which he viewed as a generalization of Bernoulli and Laplace’s Principle of Insufficient Reason. Using the tools of the calculus of variations the classical ME is briefly summarized.6 Given T structural constraints in the form of moments of the data (distribution), Jaynes proposed the ME method, which is to maximize H(p) subject to the T structural constraints. Thus, if we have partial information in the form of some moment conditions, X (t=1, 2, …, T) , where T<M, the ME principle prescribes choosing the p(a ) that t i maximizes H(p) subject to the given constraints (moments) of the problem. These constraints can be viewed as certain “conservation laws” or “moment conditions” that represent the available information. His solution to this underdetermined problem is (cid:236) (cid:252) pˆ(a )(cid:181) exp(cid:237) - (cid:229) lˆ X (a )(cid:253) (2.2) i t t i (cid:238) (cid:254) t where ll are the T Lagrange multipliers, and llˆ are the values of the optimal solution (estimated values) of ll. Naturally, if no constraints (data) are imposed, H(p) reaches its maximum value and the p's are distributed uniformly. Specifically, if the available information is in the form of 8 (cid:229) p =1 and (cid:229) p g (X ) = E[g ], t=1, 2, …, T, (2.3) i i i i t i t where E is the expectation operator and g (X )” 1 for all i, then the least “informed” 0 i (prejudiced) proper distribution that obeys these T+1 restrictions is: { } (cid:236) T (cid:252) pˆ = exp - lˆ - lˆ g (X )- lˆ g (X )(cid:215) (cid:215) (cid:215) - lˆ g (X ) = exp(cid:237) - (cid:229) lˆ g (X )(cid:253) . (2.4) 0 1 1 i 2 2 i T T i t t i (cid:238) t=0 (cid:254) The entropy level is T H = lˆ + (cid:229) lˆ E[g(X )]. (2.5) 0 t i t=1 The partition function (known also as normalization factor or the potential function), l , 0 is defined as Ø (cid:230) T (cid:246) ø l0 = logŒ (cid:229) exp(cid:231)(cid:231) - (cid:229) lˆtgt(Xi)(cid:247)(cid:247) œ (2.6) º i Ł t=1 ł ß and the relationship between the Lagrange multipliers and the data is given by ¶ l [ ] - 0 = E g (2.7) ¶ l t t while the higher moments are captured by ¶ 2l ¶ 2l 0 =Var(g ) and 0 =Cov(g g ). (2.8) ¶ lt2 t ¶ lt¶ ls t s With that basic formulation, Jaynes was able to “resolve” the debate on probabilities vs. frequencies by defining the notion of probabilities via Shannon’s entropy measure. His principle states that in any inference problem, the probabilities should be 6 ME is a standard variational problem. See for example, Goldstine (1980) and Sagan (1993). 9 assigned by the ME principle, which maximizes the entropy subject to the requirement of proper probabilities and any other available information.7 Prior knowledge can be incorporated into the ME framework by minimizing the cross-entropy, rather than maximizing the entropy, subject to the observed moments. If ~ pis the solution to such an optimization problem, then it can be shown thatI(p;q) = I(p;~p)+ I(~p;p) for any p satisfying the set of constraints (2.3), which is the analogous to the Pythagorean Theorem in Euclidean geometry, where I(p;q) can be regarded as the analogous for the squared Euclidean distance. There exists an important interpretation of (2.4)-(2.6) within the context of Bayes theorem. The exact connection between ME, information and Bayes theorem is developed in Zellner (1988), discussed in the various papers of Jaynes, and is generalized in this volume (Zellner, 2001). Finally, one cannot ignore two basic questions that keep coming up: Is the ME principle “too simple?” and does the ME principle “produce something from nothing?” The answer to the above is contained in the simple explanation that this principle uses only the relevant information, and eliminates all irrelevant details from the calculations by averaging over them. 7 In the fields of economics and econometrics, it was probably Davis (1941) who conducted the first work within the spirit of ME. He conducted this work before the work of Shannon and Jaynes, and therefore he did not use the terminology of IT/ME. In his work, he estimated the income distribution by (implicitly) maximizing the Stirling’s approximation of the multiplicity factor subject to some basic requirements/rules. A nice discussion of his work and of the earlier applications and empirical work in IEE in general and ME in particular in economics/econometrics appears in Zellner (1991) and Maasoumi (1993). For recent theoretical and applied work see the papers and citations provided in the current volume. 10
Description: