Habits vs Environment: What Really Causes Asthma? Mengfan Tang Pranav Agrawal Ramesh Jain DepartmentofComputerScience DepartmentofComputer DepartmentofComputer UniversityofCalifornia,Irvine Science Science [email protected] UniversityofCalifornia,Irvine UniversityofCalifornia,Irvine [email protected] [email protected] ABSTRACT data, smart-phone and wearable sensor data, a data driven 6 solution holds the promise for helping solve this problem 1 Despite considerable number of studies on risk factors for [12]. Thus, lots of studies have been performed using var- 0 asthma onset, very little is known about their relative im- ious forms of data. [11] predicted asthma attacks by con- 2 portance. To have a full picture of these factors, both cat- sideringbio-signalsofpatientsandenvironmentaldata. [10] egories, personal and environmental data, have to be taken n studied the association between spatial distribution of al- into account simultaneously, which is missing in previous a lergy prevalence and air pollutants such as PM2.5, as well J studies. We propose a framework to rank the risk factors as living distance from point of interests such as parks and from heterogeneous data sources of the two categories. Es- 0 roads. [3] detected asthma risk from personal profiles. [5] tablishedontopofEventShopandPersonalEventShop,this 2 designedtoolsfordiscoveringdynamicchangesinbodysen- framework extracts about 400 features, and analyzes them by employing a gradient boosting tree. The features come sor network data streams of asthma patients. [1] studied ] occupation as a factor in asthma. [6] researched associated Y from sources including personal profile and life-event data, risk factors of air pollution and weather, [7] revealed the andenvironmentaldataonairpollution,weatherandPM2.5 C impact of lifestyle and behavior on asthma vulnerability. emission sources. The top ranked risk factors derived from . However,thesestudiesonlyfocusoneitherenvironmental s our framework agree well with the general medical consen- c sus. Thus, our framework is a reliable approach, and the factors or personal factors without a comprehensive study [ covering both. Asthma patients vary in sensitivity to dif- discoveredrankingsofrelativeimportanceofriskfactorscan provide insights for the prevention of asthma. ferent environmental and personal factors, and their inter- 1 actions. A data driven solution can convert personal data v and environmental data into information and insights for 1 CategoriesandSubjectDescriptors discoveringcomprehensiveriskfactors, aswellasaidinun- 4 H.4[InformationSystemsApplications]: Miscellaneous; derstanding of asthma. 1 D.2.8 [Software Engineering]: Metrics—complexity mea- To identify these risk factors within integrated diverse 5 sures, performance measures; J.3 [Life and Medical Sci- data sources, new approaches are required. We propose a 0 ences]: Health frameworkontopofPersonalEventShop[8]andEventShop . 1 [4],tosolvethisproblembecauseoftheircapabilityofinte- 0 GeneralTerms gratingandanalyzingheterogeneousdatasources. EventShop 6 isagenericinfrastructureforapplicationdeveloperstoana- 1 Experimentation, Human Factors lyzevariedspatio-temporaldatastreams. PersonalEventShop : v Keywords isaunifiedframeworkforaggregatingpersonaldatastreams i toanalyzelifeeventsandpersonalsituations. Detailsofthis X Asthma,Featureextraction,Asthmariskanalysis,Gradient framework are described in the next section. r Boosting Tree a 2. FRAMEWORK 1. INTRODUCTION TheproposedframeworkisbasedonPersonalEventShop Althoughasthmaisapotentiallylifethreateninglungdis- andEventShop. Thethreemainpartsoftheframeworkare easeandhasbeenstudiedforalongtime,asthmacausesare briefly discussed. stillunclear. Withincreasingamountofdata,suchasElec- tronic Health Records, social media, environmental sensory Dataingestion Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor Personal EventShop receives data from wearable sensors, personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenot mobileappslikecalendar,smart-phonesensorslikeaccelerom- madeordistributedforprofitorcommercialadvantageandthatcopiesbear etersandGPS.Itcorrelatesthesedatastreams,determines thisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponents ofthisworkownedbyothersthanACMmustbehonored.Abstractingwith user’sactivitylevels,lifeeventssuchasexercising,sleeping, creditispermitted.Tocopyotherwise,orrepublish,topostonserversorto andworking. Thischronicleoflifeevents,called‘Personicle’ redistributetolists,requirespriorspecificpermissionand/orafee.Request is stored in a database with attributes of user-id, location, [email protected]. timeandlifeevents. Italsocollectsdatafromglobalsources WebSci’15June28-July01,2015,Oxford,UnitedKingdom like EventShop, which can provide aggregated environmen- (cid:13)c2015ACM978-1-4503-3672-7/15/06$15.00. taldatafromdifferentsensors. Inthecontextofthispaper, DOI:http://dx.doi.org/10.1145/2786451.2786481. Figure 1: 1(a)-1(c) Maps showing expected county level emissions from agriculture, road dust, and mining respectively 1(d)-1(f) Maps showing average maximum, mean, and minimum PM2.5 concentration values for the month of June, for counties with data in CHAD EventShop can provide interpolated PM2.5 values inferred the data is discussed in the section below. fromsourceslikeairpollutionmonitoringsites,satelliteim- agery and traffic [14]. All of this data is collected and sent ConsolidatedHumanActivityData for aggregation and matching. Consolidated Human Activity Database Master (CHAD) [13] is a collection of profile and activity information from Dataaggregation 22 studies, and contains data for over 54,000 days com- Ingesteddatahasawiderangeofgranularitiesinbothspace binedfromover700countiesacrosstheUnitedstates. Each andtime. Forexample,inthecontextofthispaper,emission personal profile contains information like zip-code, county, inventorydatawasreportedoncountylevelonceeverythree age, gender, race, history of asthma, cardiovascular illness, years, activity data was reported over a period of several whether the person is a smoker, or lives with a smoker, years,andvaryinglevelsofspatialgranularity,andpollution employment, education level, and income. Every person datawascollectedatstationlocations,andwasreportedas has a set of diary entries. Each diary entry is an activity daily values. Personal EventShop takes in all this data and record containing activity (like walking, exercising, sleep- matchesittoacommonspatio-temporalscale. Thisdatais ing,leisure),location(home,office,traveling),startandend sent to the Analysis Engine. times. Everydiaryentryalsocontainsflagsshowingwhether the subject was breathing heavily and whether the subject AnalysisEngine was smoking during each activity. Thispartusesaggregateddata,andperformsdesiredanaly- EmissionFactors sesonit. Forexample,theseanalysescanbe,co-occurrence patterndetection,personalsituationrecognitionorlifeevent TheNationalEmissionInventorydatawascollectedin2011, detection. Inthiscase,theaggregateddatafromthesources by the EPA. This is an estimate of PM2.5 contributions of described in section 3 are transformed to features as de- all the sources of air pollution within the United States, in scribedinsection4. Thefeatureswiththehighestdiscrimi- metric tonnes/year. Sources are categorized into classes natorypoweraredetectedusingthemodeldescribedinsec- like Agricultural, Industrial, Dust, Fuel, and Mobile. tion 5 and then used to rank the asthma risk factors. HistoricalAirPollutionData 3. DATA Historical air pollution data is the actual concentrations of Our asthma risk analysis approach requires various per- pollutantsintheenvironment. Thisdataisavailableasdaily sonal and environmental features at the same granularity. averagevaluesrecordedateachoftheindividualmonitoring However, the data is at different resolutions, such as point stations across the United States. The stations also record data and region data. Data sources are described in this thedailyaverageweatherconditions: temperature,pressure, section, and the methodology to obtain the features from wind speed, and wind direction. In the next section, we describe the features extracted Table 2: Emission Factors from these data sources, and how they are used. Category Emission Factor Mobile Aircraft, Marine Vessels, Locomo- 4. FEATUREANDDATAANALYSIS tives, Equipment, Heavy Duty Vehi- cles, Light Duty Vehicles 4.1 PersonalLifeEventsFeatures: FP Industrial Agricultural,Mining,Oil&GasPro- WeusetheprofileinformationprovidedinCHADforpeo- duction, Storage & Transportation, ple that had location information at the county level, and Other hadansweredthequestionwhethertheyhadasthmaornot. Dust Construction,PavedRoadDust,Un- The number of such people was 11,000, with 24,000 days of paved Road Dust activity data. Of these, around 950 people reported having Fires Agricultural Field Burning, Pre- asthma,whichisclosetotheasthmaprevalencerateof8.7% scribed Fires, Wildfires in the United States [9]. For analysis, we use the asthmatic Fuel Biomass, Coal, Natural Gas, Oil, subjects and an equal number of non-asthmatic subjects. Residential Wood, Other Thenon-asthmaticsubjectsarerandomlychosentobalance Miscellaneous Waste Disposal, Agriculture, Com- the data set. mercial Cooking Features extracted and their justifications, as mentioned in [9] are described in Table 1. as ti : hb Table 1: Extracted Personal Profile Features ti =mean((cid:88)d ) {∀j,x:x∈Dj and hb =1} Feature Comment hb j x x,j pi x,j Age Children are more susceptible The average amount of time in the entries where p reports i than adults smoking, is also calculated as ti, in the same way. s Gender Femalesaremoresusceptiblethan Theaveragecountofthenumberoftimesp reportsheavy i males breathing, is recorded as (ni ): hb Active & Passive Smokers are more susceptible Smoking nihb =mejan((cid:88)hbx,j) {∀j,x:x∈Dpji and hbx,j =1} Occupation, Income, Lowerincomeandlowereducation x & Education correlate with asthma Likewise, the average count of activities per day, where the Hours of work & Em- Measure of stress smoking flag was 1, is calculated as ni. Figure 4.1 shows s ployment Status twenty five randomly chosen daily diaries from the data. Gas stove ownership, Exposure to smoke heating & fuel type 4.2 EmissionFactorsFeatures: FE PM2.5emissionsourcesfromthenationalemissioninven- tory were used at county level. Different kinds of PM2.5 Activity data is in the form of a set D of daily diary emissions have different health effects on people. The emis- entry tuples. Each tuple x for the day j contains, the ac- sion amount by each type of pollution source, like mobile, tivity’sstarttime,durationd ,activitycodea ,location x,j x,j industrial, dust, fires, and fuel, was used as a feature. The code l , flags for whether the person was smoking (s ), x,j x,j various emission factors are listed in Table 2. whetherthepersonwasbreathingheavily(hb )duringthe x,j activity . First, locations are classified into a set L of five 4.3 AirPollutionFeatures: F A non-exclusive categories: Historical air pollution data records are in the form of daily average pollutant concentration values at the measur- L={work, travel, home, indoor, outdoor} ingstationlevel. Since all of ouranalysisisatcountylevel, ActivitiesareclassifiedintoasetAofsixnon-exclusivecat- we interpolate this point data to region data. This can be egories: done using a lot of techniques, ranging from pure chemical transportation models, to enhanced data based techniques A={sleep, work, exercise, walking, cycling, leisure} [14]. For this experiment, linear interpolation was used. The data used is between years 2001 and 2014. The pol- For every person p for day j, in the set of people P, let i lutants considered are: PM2.5, Ozone, Carbon Monoxide, the set of corresponding diary entries be Dj . For every pi Sulphur Dioxide, and Nitrogen Dioxide. The weather fac- personp ,andeverylocationcategoryL ,theaveragetime i k torsconsideredare: temperature,pressure,andwindspeed. spentdailyatlocationsinthatcategoryti iscomputedfor Lk For each pollutant and weather factor f, the following diary entries x in Dj over all j: pi features are extracted: ti =mean((cid:88)d ) {∀j,x:x∈Dj and l ∈L } fm =mean(max(vm,y)) Lk j x,j pi x,j k max y d d,f x pSiemrfiolarmrlyin,gfoarctaicvtiitvieitsieisn, etahcehaavcetriavgietyticmateegsporeyntAda,ildyenboytepdi fmmean =meyan(medan(vdm,f,y)) k by ti , is computed. Also, the average amount of time in Ak fm =mean(min(vm,y)), the entries where pi reports heavy breathing, is computed min y d d,f Figure 2: 25 randomly chosen diary entries from CHAD. Each coloured row on the y-axis represents a day’s activities for one person. The color marks the activities, and the x-axis shows time in 15 minute increments where, f ∈{PM2.5, SO , NO , O , CO, temperature, 2 2 3 pressure, windspeed},yistheyearbetween2001and2014, m is the month and d is the day of the month. 5. MODELS Consider a feature vector x ∈ Rd in the d-dimensional featurespace,y+isthelabelforsubjectswithasthma,y−is the label for subjects without asthma. Our feature ranking modelisbasedongradientboostingregressiontree[2]. Risk factorsarerankedbyanalyzingimportanceoffeaturesfrom this model. Gradient boosting tree framework generates an ensembleofweakregressiontreemodelsandcombinesallthe weak learners to produce a strong classifier. The training algorithm performs gradient descent in function space to minimizeadifferentiablelossfunction. SupposeF(x)isthe model, we have, M (cid:88) F(x)= λ f (x) (1) i i i=1 where f (x) is the ith weak learner, λ is the weight associ- i i ated with that weak learner. One of the advantages of using this model for ranking Figure 3: Relative importance of features as deter- features in asthma study is that no normalization of data mined by a gradient boosting tree is needed, and thus it is better at handling categorical data with discrete and continuous data. The model hyper-parameters for the gradient boosting the identified data sources. classifier(depthandnumberoftrees)areselectedusingfive- The most significant contribution of our paper is the de- foldcrossvalidation. Thedepthwasvariedbetween1,2,and termination of the relative importance of factors in causing 3,andthenumberoftreeswaschosenfrom50,100,and150. asthma. The twenty most effective features out of around Afterchoosingthebesthyper-parameters,thelearnedmodel 400extractedfeatures,andtheirrelativeimportanceisshown was then applied to the remaining subset. in Figure 5. The top five features correspond to personal AKnearestneighborclassifierwasusedtopredictwhether factorssuchasphysicalexertionandworkstress. Following a person is asthmatic or not, using their features. The pa- that, environmental features like forest fires and industrial rameter K was, again, cross validated on the training data fuel burning show up. Other features that appear in the using five folds, to select the best K. analysis are smoking habits, passive smoking, personal ex- posure to smoke from domestic fuels and heating. Lastly 6. RESULTS PM2.5 and SO concentration during summer (July) also 2 Inthissectionwepresentourresultsbasedonthefeatures appear on the list. These are all, independently, known to extractedfromthedatasources,showingtheeffectivenessof be important factors affecting asthma susceptibility. Our ordering suggests that personal factors like physical exer- [2] J. H. Friedman. Greedy function approximation: A tionandstressarestrongindicatorsofasthmarisk,followed gradient boosting machine. Ann. Statist., by exposure to PM2.5 and smoke from the environment. 29(5):1189–1232, Oct. 2001. The personal features obtained here are particularly useful [3] S. P. Galant, L. J. R. Crawford, T. Morphew, C. A. becausetheydon’tneedextensivediarykeepingbythesub- Jones, and S. Bassin. Predictive value of a jects. Mostofthem,likephysicalexercisetime,workhours, cross-cultural asthma case-detection tool in an time at home, and breath rate can be easily obtained from elementary school population. Pediatrics, wearablesensorsandsmart-phones. Otherpersonalfeatures 114(3):e307–316, Sept. 2004. like, smoking habits, fuel type, and occupation can be ob- [4] M. Gao, V. K. Singh, and R. Jain. Eventshop: From tained from one time questions. Heterogeneous Web Streams to Personalized Situation Theeffectivenessofourfeaturesisshownbythegoodper- Detection and Control. In Proceedings of the 4th formance of an asthma classifier trained on them. Four K- Annual ACM Web Science Conference, WebSci ’12, nearest neighbors classifiers are trained on the various sub- pages 105–108, New York, NY, USA, 2012. ACM. sets of the data: one on just FP, one on FP and FA, one [5] M. K. Garg, D.-J. Kim, D. S. Turaga, and on FP and FE and one on all, FP,FE and FA. The Area B. Prabhakaran. Multimodal Analysis of Body Sensor Under the Curve metric, which is a commonly used metric Network Data Streams for Real-time Healthcare. In in clinical studies [15], is used to evaluate the performance Proceedings of the International Conference on of the classifier. The performance is shown in Table 3. The Multimedia Information Retrieval, MIR ’10, pages table clearly shows, that asthma prediction based on both 469–478, New York, NY, USA, 2010. ACM. environmental features and personal features, significantly [6] W. Ho, W. Hartley, L. Myers, M. Lin, Y. Lin, C. Lien, outperforms using any subset of these features. It may be and R. Lin. Air pollution, weather, and associated risk notedthattherecallofthesystemimprovessignificantly,on factors related to asthma prevalence and attack rate. moving from F +F to F +F +F , at the cost of a P E P E A Environmental Research, 104(3):402–409, July 2007. small drop in precision. [7] C. Y. Hong, T. P. Ng, M. L. Wong, K. T. C. Koh, L. G. Goh, and S. L. Ling. Lifestyle and behavioural Table 3: Classifier performance against feature set risk factors associated with asthma morbidity in Features Precision Recall AUC adults. Qjm, 87(10):639–645, 1994. F 0.818 0.789 0.807 [8] L. Jalali and R. Jain. Building Health Persona from P F +F 0.846 0.864 0.853 PersonalDataStreams.InProceedings of the 1st ACM P A F +F 0.902 0.807 0.860 International Workshop on Personal Data Meets P E F +F +F 0.898 0.927 0.911 Distributed Multimedia, PDM ’13, pages 19–26, New P E A York, NY, USA, 2013. ACM. [9] M. JE, A. LJ, and B. CM. National Surveillance of 7. CONCLUSIONS Asthma: United States, 2001-2010. Vital Health Statistics, Nov. 2012. There is no clear understanding of the causes of asthma [10] Y. Kanani Sadat, F. Karimipour, and andnodefinitecure,makingitdrawincreasingattentionin A. Kanani Sadat. Investigating the Relation Between healthcarestudies. Existingstudiesoftenareperformedon Prevalence of Asthmatic Allergy with the either personal data or environmental data. We proposed Characteristics of the Environment Using Association aframeworkontopofEventShopandPersonalEventShop, Rule Mining. ISPRS - International Archives of the integratingenvironmentaldataandpersonalprofileandac- Photogrammetry, Remote Sensing and Spatial tivitydata,andextractpersonalandenvironmentalfeatures Information Sciences, XL-2/W3:169–174, Oct. 2014. thatdetermineaperson’ssusceptibilitytoasthma. Arank- [11] C.-H. Lee, J. C.-Y. Chen, and V. S. Tseng. A novel ingofthesefeaturesisgivenbasedontheirpotentialincaus- data mining mechanism considering bio-signal and ing asthma. The features derived from our analysis agree environmental data with applications on asthma with the current general consensus about factors affecting monitoring. Computer Methods and Programs in asthma, in the medical community. Biomedicine, 101(1):44–61, Jan. 2011. One of the directions for future work is to extend the framework to a personalized system for individual patients [12] S. Ram, W. Zhang, M. Williams, and Y. Pengetnze. bytakingintheirdailyactivitydata,throughInternetcon- Predicting Asthma-Related Emergency Department necteddevices,andenvironmentaldataaboutspatio-temporal Visits Using Big Data. IEEE Journal of Biomedical factors causing asthma attacks. This can be, then used to and Health Informatics, PP(99):1–1, 2015. determine personalized high risk zones for the individual, [13] M. T, G. G, S. L, and L. Y. The National Exposure and personalized risk factor profiles. Research LaboratoryaˆA˘Z´s Consolidated Human Activity Database. Journal of Exposure Analysis and 8. ACKNOWLEDGMENTS Environmental Epidemiology, 10(6):566–578, 2000. We thank Xikui Wang for helping extract air pollution [14] M. Tang, P. Agrawal, S. Ponpaichet, and R. Jain. features. GeospatialinterpolationAnalyticsforDataStreamsin EventShop. In IEEE International Conference on 9. REFERENCES Multimedia and Expo (ICME), 2015. [1] A. A. Arif, G. L. Delclos, and C. Serra. Occupational [15] M. H. Zweig and G. Campbell. Receiver-operating exposures and asthma among nursing professionals. characteristic (ROC) plots: a fundamental evaluation Occup Environ Med, 66(4):274–278, Apr. 2009. tool in clinical medicine. Clinical chemistry, 39(4):561–577, 1993.