Table Of Content

DETECTING CLASS-INDEPENDENT LINEAR RELATIONSHIPS WITHIN AN ARBITRARY SET OF FEATURES Ashwin Sarma Naval Undersea Warfare Center, Newport, RI 02841 ABSTRACT velocity respectively at time t. Such estimates are Classifiers for surveillance sonar systems are often designed provided by tracking algorithms commonly used in to operate on large sets of predefined clues, or features. sonar. Note that (cid:1)g(Z)+f(cid:1)(Z) = 1. It may be that Sometimes the mathematical definitions for these features certain classes of objects perform more maneuvers arepoorlyknown.Othertimesthedesignerisnotawarethat than others and will exhibit this in terms of higher a fixed and class-independent linear (or affine) relationship values of (cid:1)g(Z) and lower values of f(cid:1)(Z). However exists between subsets of features. We discuss a method f(cid:1)(Z) is completely defined by (cid:1)g(Z). based on Gram-Schmidt orthogonalization which allows the Insuchcasesthedependencemayhavebeenknowntothe classifier designer to determine whether subsets of features classifier designer, but in real cases with large feature sets have such relationships. Certain features can then be shown the dependence may be unknown or difficult to determine. unnecessarybyapplicationofWozencraftandJacobs’“The- Furthermore, in real problems, more complicated examples orem of Irrelevance”. An approach is also described to rank of affine dependence can and do arise in obscure and features to aid in the selection of an effective subset. inadvertent ways. We are interested in detecting if any such affine rela- I. INTRODUCTION tionships are present among features. In this short paper The design of classifiers for a surveillance sonar system we describe a nonparametric method for detecting affinely is often based on a pre-existing set of measurements or dependentfeaturesthatisbasedonthelinearalgebraicstruc- ”features” considered useful for discrimination. This set tureofthefeaturespaceandisindependentoftheunderlying can be large and usually includes features that are simple distributions as well as separability among classes. Removal functions of basic physical measurements of the objects to of such dependent features is argued through application be classified. Features are often defined via linear functions of Wozencraft and Jacobs’ “Theorem of Irrelevance” [1]. of fundamental measurements or other more basic features. Removalofsuchfeaturesiscriticalastheyaddnoadditional However, these functional relationships are sometimes un- informationandunnecessarilyincreasethedimensionalityof known to the designer or difficult to sort out. the feature space. Denote as Z the raw measurements collected from an “Approximate” affine dependence can also be identified objecttobeclassified.Featurescontainedinthevectorf(cid:1)(Z) andarankingofthefeaturesispossible.Thepre-processing are said to beaffinely dependent on another set of features advocated herein is a simple but often overlooked first step contained in the vector (cid:1)g(Z) if f(cid:1)(Z) =(cid:1)b+A(cid:1)g(Z), ∀Z. In in the design of sonar classifiers. The method has proved other words there exists fixed vector (cid:1)b and matrix A such usefulfordetectingandremovingaffinelydependentfeatures that f(cid:1)(Z) is given by the above relationship regardless of present in Navy feature databases. the object type (class). Examples include: II. “THEOREM OF IRRELEVANCE” (1) If (cid:1)g(Z) consisted of a single feature such as an estimate of the angular width of an object and f(cid:1)(Z) Denote the class-conditional pdf for class i as pr(ρ|i) whereristherandomvectoroffeaturesandρisaparticular wasthecross-rangeextentmeasuredbyasonarsystem corresponding to that object at a fixed (non- random) realizationofr.Thendecomposerasr=[r1 r2]T.Denote range R then (cid:1)g(Z) =∆θ and f(cid:1)(Z) =R∆θ. Certain ρ1 and ρ2 as the corresponding realizations of r1 and classes of objects may possess greater angular widths r2. We observe that pr(ρ|i) = pr1(ρ1|i) pr2(ρ2|i,r1 = than others. However, feature f(cid:1)(Z) is completely ρ1) by Bayes rule. Wozencraft and Jacobs’ ”Theorem of defined by (cid:1)g(Z). Irrelevance” states that the optimum classifier/dichotomizer (2) If (cid:1)g(Z) and f(cid:1)(Z) are estimated probabilities of an may disregard a vector r2 if and only if objectundergoingaccelerationormaintainingconstant pr2(ρ2|i,r1 =ρ1)=pr2(ρ2|r1 =ρ1) (1) This work was supported by the Office of Naval Research (Award In other words r2 conditioned on r1 must be statistically Number:N0001404WX20326)andaNUWCin-houseresearchgrant. independent of i for r2 to be declared ”irrelevant”. Another 0-933957-35-1 ©2007 MTS Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED SEP 2007 2. REPORT TYPE 00-00-2007 to 00-00-2007 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Detecting Class-Independent Linear Relationships within an arbitrary set 5b. GRANT NUMBER of features 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Naval Undersea Warfare Center,Newport,RI,02841 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES See also ADM002047. Presented at the MTS/IEEE Oceans 2007 Conference held in Vancouver, Canada on Sep 29-Oct 4, 2007. 14. ABSTRACT See Report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 4 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 useful interpretation is that satisfaction of this requirement manifested due to chance is effectively zero. It is worth impliesthatr1 isasufficientstatisticforestimatingthepdfs mentioning that the dependence is not linked to the manner pr(ρ|i), ∀i, although not necessarily minimal [2], [3], [4]. in which X was populated. Specifically a permutation of the rows of X (equivalently X˜) will not alter the linear III. DETECTING AFFINELY DEPENDENT dependence.Thiscanbeseenbynotingthatpre-multiplying FEATURES x˜k in (4) by a permutation matrix results in The approach involves construction of a data matrix X in which all available data are included. If C classes are u˜k =α˜1ku˜1+α˜2ku˜2+...+α˜kk−1u˜k−1 (5) to be distinguished and Ni (1 x D) real-valued vector where u˜j, j =1,...,D is the permuted x˜j. measurements (i.e. feature vectors) are available for i = Thus feature fk must be expressible as 1,...,C, X is a N x D dat(cid:1)a matrix where each row is a feature vector and N = Ci=1Ni. It is assumed that fk =α1kf1+α2kf2+...+αkk−1fk−1+bk (6) N > D, a reasonable assumption as there are commonly or moremeasurementsthanfeatures.Anunsupervisedsituation (cid:2)D in which N unlabeled feature vectors are presented can also fk = αjkfj +bk (7) be considered. j=1 DenotetheDfeaturesbyf1,f2,...,fD.Weareinterested where αjk,j ≥ k must equal 0. If Rmm (k < m ≤ D) in affine dependence of the form: is the next zero element along the diagonal of R then x˜m (cid:2)D is linearly dependent on the vectors x˜1,x˜2,...,x˜m−1 and fk = αjkfj +bk (2) there exists another set of coefficients, α˜1m,α˜2m,...,α˜mm−1 j=1,j(cid:1)=k such that x˜m can be expressed as: where bk need not equal zero. Note that the linear algebraic x˜m =α˜1mx˜1+α˜2mx˜2+...+α˜kmx˜k+...+α˜mm−1x˜m−1 (8) conceptoflineardependenceamongasetofvectorsdoesnot accommodateaffinedependence.Thusatransformationsuch Onceagainnotallα˜m’sneedbenonzero.Furthermoresince as standardization [5] must be performed before standard x˜k islinearlydependentonx˜1,x˜2,...,x˜k−1,α˜km canbeset linear algebraic techniques can be applied. This amounts to equal to zero. Thus fm can be expressed as estimatingthesamplemeanµândsamplestandarddeviation (cid:2)D σˆ of each column of X and transforming every element x fm = αjmfj +bm (9) in that column according to x=(x−µˆ)/σˆ. Denote X after j=1 standardization as X˜. The transformed features are denoted if˜s1,efl˜i2m,.in.a.t,efd˜D,.i.eA.ny affine dependence (bk (cid:2)= 0 in eqn. (2)) whTehreerαefjmoremuthset esqeutaolfzfeeraotuforersjr=2 kforaswwheicllhaRsjfjor=j ≥0 amre. affinely dependent on the remainder of the features denoted f˜k = (cid:2)D α˜jkf˜j (3) dr1et.eFrmroimneedqnbsy.(ρ71)ainndde(p9e)n,dweentcaonfsceleastshait.ρT2hiusscoemqnp.le1teliys j=1,j(cid:1)=k satisfied and r2 is irrelevant. It is worth noting that if N = (see appendix). It is clear that the elements of X˜ will be D and X is full rank then X˜ has a rank of D −1. Thus approximately zero mean with unit variance. Thus standard- this process can be successfully applied only if N is strictly izationhastheaddedeffectofcontrollingthedynamicrange greater than D. of the elements of X and numerically stabilizing Gram- Even if an exact affine dependence existed, errors in Schmidt orthogonalization. machinecalculationofRcanleadtonon-zeroRjj.Higham Gram-Schmidtorthogonalizationcanbeperformedonthe [6, p. 24,122] has shown that performing the QR decom- columns of X˜ via QR matrix decomposition, i.e X˜ =QR. position via a sequence of Givens rotations leads to an The linearly dependent column vectors in X˜ are marked by ultimate relative error that is acceptable and on the level of zeros along the diagonal of R. machineprecisionu.Hererelativeerror,definedas ||R−Rˆ||2 ||X˜||2 If Rkk is the leftmost zero element along the diago- (= ||R−Rˆ||F)isthenormalizeddifferencebetweentheexact xnã1l,xõ2f,.R..,tx˜hekn−1x˜kanids tlhineerearleyxidsetspeandesentt oofn cthoeeffivceicetnotrss, R and||X˜Rˆ||F, estimated by the actual dec(cid:3)omposition. Note that α˜1k,α˜2k,...,α˜kk−1 such that x˜k can be expressed as: ||R−Rˆ||F ≈ √D(cid:6) = D(cid:6)=u (10) x˜k =α˜1kx˜1+α˜2kx˜2+...+α˜kk−1x˜k−1 (4) ||X˜||F ND N Note that at least one (but not all) α˜k’s need be nonzero. where (cid:6) is the average value of |Rij−Rîj| (i,j =1,...,D). Sincethisdependence isenforcedoverallC classesandfor Thus if N ≈100D, Rjj can be compared to a threshold of all N measurements, the probability that such a dependence ten times machine precision to detect linear dependence. 0-933957-35-1 ©2007 MTS If certain features are preferred (ex. considered more then if the p-value is greater than the critical level we may intuitive or powerful) by the designer, reordering can be conclude fk is “approximately” affinely dependent on the performed such that these populate the leftmost columns of previous features. Specifically eqn. (1) with r2 = fk and X.Thisstepincreasesthelikelihoodthatsuchafeaturewill r1= [f1,f2,...,fk−1]T is satisfied at the reported p-value. beretainedintheeventthatitisaffinelydependentonother Graphical methods such as quantile-quantile plots can be features. used to corroborate the conclusions. The process of feature subset selection involves consider- IV. “APPROXIMATELY” AFFINELY DEPENDENT ing various subsets and, via a pre-chosen separability mea- FEATURES sure (a.k.a. Filter Method) or a classifier architecture (a.k.a. At this point let us assume that features that are exactly Wrapper Method), ranking the quality of each subset [7], affinely dependent have been detected and removed. Op- [8]. As testing every possible subset is usually prohibitive, erating on the resultant set, the process of Gram-Schmidt alternatives that consider only specific subsets are almost orthogonalization of x˜k amounts to always applied. One such approach is as follows. Features (1) Determining the least-squares fit of x˜k in the sub- can be ranked according to p-value. If it is required that a spacespannedby{x˜1,...,x˜k−1}.Thisisequivalentto feature must be discarded, the feature with the largest p- determining the projection of x˜k onto this subspace. value can be selected. The entire process of Gram-Schmidt We denote this projection as Px˜k where P is the orthogonalization followed by statistical testing is then re- projection matrix. peated if further reduction is required. (2) Subtracting the result from x˜k to form the orthogonal error vector e. Thus x˜k =Px˜k+e. The norm of the V. SUMMARY error vector e, ||e||, is minimized in this process [3, Methodologies are provided to detect exact and “approx- p. 365] but is never zero. imate” affine relationships within a set of features. They Again we observe that permuting the rows of X (equiva- provideinsightintofeaturestructureandaidinfeaturesubset lently X˜) only permutes the elements of e but leaves ||e|| selection. unchanged. The relationship of Gram-Schmidt orthogonalization to VI. APPENDIX least-squares estimation is important in that, for N > D, we can interpret feature fk as a sum of A) a fixed (and Assumetheaffinedependence ofeqn.(11)forfeaturefk. class-independent) linear function of the previous features (cid:2)D and B) a residual. It is possible that some (or all) of the fk = αjkfj +bk (11) previous features may be useful for classification. It is also j=1,j(cid:1)=k possible that the least-squares estimate of fk in terms of the previousfeaturesisusefulaswell.Howeverthisinformation This dependence must be enforced regardless of class. Thus is implicit in the previous features. Thus it is sufficient to (cid:2)D dTehtiesrmaminoeuntthsetoadadsetadtiisntifcoarlmteastito1ncocmapptaurrinedgtihnetshcealarresviadluuaels. xk = αjkxj+bk1N (12) j=1,j(cid:1)=k in e corresponding to class i with those corresponding to class j. Nonparametric tests such as a Chi-squared test where xj is the jth column of X (j = 1,..,D) and 1N is (or Kolmogorov-Smirnov test) can be used to compare two the vector [11 ... 1]T. (cid:4) (cid:5)(cid:6) (cid:7) samples. Specifically the p-value of the test statistic can be N returned. The p-value is the probability of the event that a value of the test statistic greater than or equal to that observed occurs when both samples have a common proba- Denote xjT1N, the sum of elements in column j, as Sj. bilitydistribution.Standardcriticallevelsofsignificanceare Denote the columns of X after standardization as x˜j. Thus 0.05. Thus if the p-value is less than the critical level we x˜j equals csaamnpslaefseliyscdoifnfcelruednet.tAhatditfhfeeredniscteribbuettiwoneegnotvheerndiinsgtritbhuetitownos x˜j = xj−σˆµˆjj1N (13) indicatesthatthefeaturemayproveusefulforclassification. Itmustbestressedthatifthereareonlyafewmeasurements where µˆj = SNj and σˆj2 = (xj−µˆj1N)NT(xj−µˆj1N). Rewriting per class, a large p-value need not indicate distributional eqn. (12) in terms of x˜j results in: similarity. However if there are enough measurements such thatp-valueslowerthanthecriticallevelareatleastpossible (cid:2)D σˆkx˜k+µˆk1N = αjk(σˆjx˜j+µˆj1N)+bk1N (14) 1orsetofpairwisetestswhenmorethantwoclassesareconsidered j=1,j(cid:1)=k 0-933957-35-1 ©2007 MTS Subtracting µˆk1N from both sides of (14) yields: (cid:8) (cid:9) (cid:2)D (cid:2)D σˆkx˜k = αjkσˆjx˜j+ αjkµˆj1N+bk1N−µˆk1N j=1,j(cid:1)=k j=1,j(cid:1)=k (cid:4) (cid:5)(cid:6) (cid:7) Γ (cid:1) It is clear that the vector Dj=1,j(cid:1)=kαjkµˆj1N +bk1N and µˆk1N can be represented as c1N and d1N respectively and that Γ = (c−d)1N. Pre-multiplying both Γ and eqn. (12) by 1NT reveals that c=d. Thus (cid:2)D x˜k = α˜jkx˜j (15) j=1,j(cid:1)=k where α˜jk equals αjkσˆj/σˆk. VII. REFERENCES [1] John M. Wozencraft and Irwin M. Jacobs, Principles of Communication Engineering, John Wiley and Sons, New York, 1965. [2] Thomas S. Ferguson, Mathematical Statistics: A Deci- sion Theoretic Approach, Academic Press, New York, 1967. [3] L. L. Scharf, Statistical Signal Processing: Detection, Estimation and Time Series Analysis, Addison-Wesley: Reading Massachusetts, 1991. [4] Edward C. Real, “Feature extraction and sufficient statisticsindetection and classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1996. [5] R.O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification, John Wiley and Sons: New York, second edition, 2001. [6] Nicholas J. Higham, Accuracy and Stability of Nu- merical Algorithms, Society for Industrial and Applied Mathematics, 1996. [7] Isabelle Guyon and Andre Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research 3, pp. 1157–1182, 2003. [8] K.Z.Mao, “Orthogonalforwardselectionandbackward elimination algorithms for feature subset selection,” IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, vol. 34, no. 1, pp. 629–634, February 2004. 0-933957-35-1 ©2007 MTS

DTIC ADA518947: Detecting Class-Independent Linear Relationships Within an Arbitrary Set of Features PDF

0.1 MB·English

by Defense Technical Information Center

#additional_collections #dticarchive

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA518947: Detecting Class-Independent Linear Relationships Within an Arbitrary Set of Features

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.