Satisfying K-Anonymity: New Algorithm and Empirical Evaluation Romeo Issa Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the degree of Master of Computer Science Under the auspices of the Ottawa-Carleton Institute for Computer Science Ottawa, Ontario, Canada January 2009 © Romeo Issa, Ottawa, Canada, 2009 Abstract Nowadays, clinical institutions are increasingly asked to make their raw data electroni- cally available for research purposes. However, the same laws that prevent casual disclo- sure of such data have also made it difficult for researchers to access the information they need to conduct critical research. Therefore, several algorithms were developed with the purpose of making that information anonymous, hence readily available for researchers. In this thesis, we present the results of an empirical evaluation of algorithms that aim to achieve k-anonymity under global recoding and hierarchical generalization, namely, Datafly and Samarati’s algorithms. We conclude that on average the latter pro- duces better results, but neither produces an optimal solution. Next, we propose a new method to efficiently find the optimal solution, and we illustrate some programming op- timizations. Finally, we compare our approach from an efficiency perspective to Incog- nito, an efficient algorithm that finds the set of all possible solutions. iii Acknowledgment I would like to express my deep and sincere gratitude to my co-supervisors, Prof. Khaled El Emam, Prof. Daniel Amyot, and Prof. Jean-Pierre Corriveau. This thesis would not have been possible without their continuous help, guidance and support. This research was conducted at the Electronic Health Information Laboratory as part of the Collaborative Health Research Project on Performance Management at the Point of Care: Secure Data Delivery to Drive Clinical Decision Making Processes for Hospital Quality Control, funded by the Canadian Institutes of Health Research and the Natural Sciences and Engineering Research Council of Canada. The work was done mainly in collaboration with Dr. Khaled El Emam, with the much appreciated support of Dr. Fida Kamal Dankar. Furthermore, many thanks go to my co-workers and the people who provided the private data sets used in the experiments, namely: Elizabeth Jonker, Elise Cogo, Sadrul Chowdhury, Regis Vaillancourt, Tyson Roffey, Jim Bottomley and Mark Walker. My sincere thanks also go to the official referees, Dr. Anil Somayaji and Dr. Car- lisle Adams, for their detailed review and constructive criticism. Above all, I want to thank Mr. Peter and Mrs. Siham Irani, and their family, who welcomed me in their home during my stay in Canada so far, and who also provided the kind of unconditional love and support a person would only expect from his own parents and family. iv Table of Contents Abstract ............................................................................................................................. iii Acknowledgment .............................................................................................................. iv Table of Contents .............................................................................................................. v List of Tables ................................................................................................................... vii List of Figures ................................................................................................................. viii List of Acronyms ............................................................................................................... x Chapter 1. Introduction ................................................................................................... 1 1.1. Motivation ........................................................................................................... 1 1.2. Research Objective ............................................................................................. 3 1.3. Thesis Contribution ............................................................................................. 4 1.4. Thesis Structure .................................................................................................. 4 Chapter 2. Background .................................................................................................... 6 2.1. Preliminary Concepts and Definitions ................................................................ 6 2.2. Previous Work ................................................................................................... 12 2.2.1 Samarati’s algorithm ................................................................................................ 13 2.2.2 Datafly algorithm ..................................................................................................... 14 2.2.3 Visualization ............................................................................................................ 15 2.3. Chapter Summary ............................................................................................. 16 Chapter 3. Empirical Evaluation .................................................................................. 17 3.1. The Optimal Solution ........................................................................................ 17 3.2. Measuring Information Loss ............................................................................. 18 3.3. Data Sets ........................................................................................................... 21 3.4. Methodology ..................................................................................................... 22 3.5. Results ............................................................................................................... 22 3.6. Conclusion ........................................................................................................ 26 v Chapter 4. New Algorithm ............................................................................................. 27 4.1. Motivation ......................................................................................................... 27 4.2. Observations ..................................................................................................... 28 4.2.1 Prediction ................................................................................................................. 28 4.2.2 Candidates ................................................................................................................ 31 4.3. Limitation of Datafly and Samarati’s Algorithms ............................................ 33 4.4. New Approach ................................................................................................... 34 4.5. Efficiency ........................................................................................................... 42 4.6. Summary ........................................................................................................... 44 Chapter 5. Optimizations ............................................................................................... 45 5.1. Four Main Optimizations .................................................................................. 45 5.1.1 A: Numbers versus Strings ....................................................................................... 45 5.1.2 B: Flat hierarchies .................................................................................................... 47 5.1.3 C: Sorting always helps ............................................................................................ 48 5.1.4 D: Rollup .................................................................................................................. 49 5.2. Overall Time Complexity .................................................................................. 50 5.3. Summary ........................................................................................................... 53 Chapter 6. Efficiency ...................................................................................................... 54 6.1. Incognito ........................................................................................................... 54 6.2. Comparison ....................................................................................................... 55 6.3. Summary ........................................................................................................... 59 Chapter 7. Conclusions .................................................................................................. 60 7.1. Contributions .................................................................................................... 60 7.2. Future work ....................................................................................................... 61 References ........................................................................................................................ 62 Appendix A: Data Sets Details and Hierarchies .......................................................... 66 Appendix B: Additional Results .................................................................................... 77 MaxSup = 1% ............................................................................................................... 77 Empirical evaluation of Datafly and Samarati ...................................................................... 77 Efficiency related graphs of our approach ............................................................................ 81 Comparison with Incognito ................................................................................................... 83 MaxSup = 10% ............................................................................................................. 84 Empirical evaluation of Datafly and Samarati ...................................................................... 84 Efficiency related graphs of our approach ............................................................................ 88 Comparison with Incognito ................................................................................................... 90 vi List of Tables Table 1 De-identified private table (medical data) ...................................................... 2 Table 2 Non de-identified publicly available table ...................................................... 2 Table 3 De-identified table .......................................................................................... 8 Table 4 2-anonymized via local recoding .................................................................... 8 Table 5 2-anonymized via global recoding .................................................................. 8 Table 6 Global recoding with suppression ................................................................... 8 Table 7 Hierarchical generalization with regard to the vector [0,1,1] ....................... 10 Table 8 (a) is a data set, and (b) is its generalization with respect to [0,0,1] ............. 19 Table 9 Summary information of the data sets .......................................................... 21 Table 10 Auxiliary functions ....................................................................................... 36 Table 11 Pseudo code for getting the optimal solution candidates (GetOCS) ............. 37 Table 12 Lattice size of the data sets ........................................................................... 42 Table 13 Race - unique items ....................................................................................... 46 Table 14 Marital status - unique items ......................................................................... 46 Table 15 Original data.................................................................................................. 46 Table 16 Transformed data .......................................................................................... 46 Table 17 GH Array – Race – (GHR) ........................................................................... 47 Table 18 GH Array – Marital Status – (GHM) ........................................................... 47 Table 19 Original data.................................................................................................. 49 Table 20 The same data hashed and sorted .................................................................. 49 vii List of Figures Figure 1 GH for Marital Status...................................................................................... 7 Figure 2 GH for Race .................................................................................................... 7 Figure 3 GH for Age ..................................................................................................... 7 Figure 4 A lattice ......................................................................................................... 11 Figure 5 Visual comparison of Datafly and Samarati’s algorithms ............................ 15 Figure 6 Information loss comparison for Adult and CUP data sets ........................... 23 Figure 7 Information loss comparison for FARS and ED data sets ............................ 24 Figure 8 Information loss comparison for Pharm and Niday data sets ....................... 25 Figure 9 Lattice illustrating “Predictions” ................................................................... 29 Figure 10 Optimal solution candidates ...................................................................... 32 Figure 11 Getting OSC - step A ................................................................................ 38 Figure 12 Getting OSC - step B ................................................................................ 38 Figure 13 Getting OSC - step C ................................................................................ 39 Figure 14 Getting OSC - step D ................................................................................ 39 Figure 15 Getting OSC - step E................................................................................. 39 Figure 16 Getting OSC - step F ................................................................................. 39 Figure 17 Getting OSC - step G ................................................................................ 39 Figure 18 Getting OSC - step H ................................................................................ 39 Figure 19 Getting OSC - step I .................................................................................. 40 Figure 20 Getting OSC - step J ................................................................................. 40 Figure 21 Getting OSC - step K ................................................................................ 40 Figure 22 Getting OSC - step L................................................................................. 40 Figure 23 Getting OSC - step M ............................................................................... 40 Figure 24 Getting OSC - step N ................................................................................ 40 Figure 25 Getting OSC - step O ................................................................................ 41 Figure 26 Getting OSC - step P ................................................................................. 41 Figure 27 “Number of evaluations” to “lattice size” ratio ........................................ 43 Figure 28 “OSC” to “lattice size” ratio ..................................................................... 44 Figure 29 Hashed GH for Marital Status................................................................... 47 Figure 30 Hashed GH for Race. ................................................................................ 47 Figure 31 Execution time in seconds. Suppression limit of 5%. ............................... 51 Figure 32 Original search space size. ........................................................................ 56 Figure 33 Nodes evaluated ratio. ............................................................................... 56 Figure 34 Performance score of Incognito with respect to our approach. ................ 58 Figure 35 Size of solutions output by Incognito. ...................................................... 58 Figure 36 GH of Native-Country .............................................................................. 66 Figure 37 GH of Age ................................................................................................. 66 Figure 38 GH of Occupation ..................................................................................... 66 Figure 39 GH of Education ....................................................................................... 67 viii Figure 40 GH of Marital Status ................................................................................. 67 Figure 41 GH of Race ............................................................................................... 67 Figure 42 GH of Sex ................................................................................................. 67 Figure 43 GH of Work Class ..................................................................................... 67 Figure 44 GH of Age ................................................................................................. 68 Figure 45 GH of Income ........................................................................................... 68 Figure 46 GH of Postal Code .................................................................................... 68 Figure 47 GH of Gender ............................................................................................ 69 Figure 48 GH of Month of death ............................................................................... 70 Figure 49 GH of Day of death ................................................................................... 70 Figure 50 GH of Age ................................................................................................. 71 Figure 51 GH of Postal Code .................................................................................... 71 Figure 52 GH of Admission date .............................................................................. 72 Figure 53 GH of Admission date .............................................................................. 73 Figure 54 GH of Admission time .............................................................................. 73 Figure 55 GH of Postal Code .................................................................................... 74 Figure 56 GH of DOB ............................................................................................... 74 Figure 57 GH of Baby Sex ........................................................................................ 75 Figure 58 GH of Baby’s Date of Birth ...................................................................... 75 Figure 59 GH of Mother’s Date of Birth ................................................................... 76 Figure 60 Information loss comparison for Adult and CUP data sets. 1% ............... 78 Figure 61 Information loss comparison for FARS and ED data sets. 1% ................. 79 Figure 62 Information loss comparison for Pharm and Niday data sets. 1% ............ 80 Figure 63 “Number of evaluations” to “lattice size” ratio. MaxSup = 1% ............... 81 Figure 64 “OSC” to “lattice size” ratio. MaxSup = 1% ............................................ 81 Figure 65 Execution time in seconds. MaxSup = 1% ............................................... 82 Figure 66 Nodes evaluated ratio. MaxSup = 1% ...................................................... 83 Figure 67 Performance score of Incognito with respect to our approach ................. 83 Figure 68 Size of solutions output by Incognito. MaxSup = 1% .............................. 84 Figure 69 Information loss comparison for Adult and CUP data sets. 10% ............. 85 Figure 70 Information loss comparison for FARS and ED data sets. 10% ............... 86 Figure 71 Information loss comparison for Pharm and Niday data sets. 10% .......... 87 Figure 72 “Number of evaluations” to “lattice size” ratio. MaxSup = 10% ............. 88 Figure 73 “OSC” to “lattice size” ratio. MaxSup = 10% .......................................... 88 Figure 74 Execution time in seconds. MaxSup 10%................................................. 89 Figure 75 Nodes evaluated ratio. MaxSup = 10% .................................................... 90 Figure 76 Performance score of Incognito with respect to our approach ................. 90 Figure 77 Size of solutions output by Incognito. MaxSup = 10% ............................ 91 ix List of Acronyms Acronym Definition DM Discernability Metric DOB Date of Birth EC Equivalence classes FSA Forward Sortation Area GH Generalization Hierarchy HIPAA Health Insurance Portability and Accountability Act IL Information Loss k The anonymization level MaxSup Maximum Suppression allowed (or Suppression limit) NE Non-Uniform Entropy OSC Optimal Solution Candidates PHIPA Personal Health Information Protection Act PT Private Table QI Quasi-Identifier SDC Statistical Disclosure Control SSN Social Security Number UI Unique Items x
Description: