Classifying Web Search Queries to Identify High Revenue Generating Customers Adan Ortiz-Cordova 329F IST Building, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802. E-mail: [email protected] Bernard J. Jansen 329F IST Building, College of Information Sciences and Technology, The Pennsylvania State University, University Park PA 16802. E-mail: [email protected] Traffic from search engines is important for most point of entry to the web for many people, the traffic from onlinebusinesses,withthemajorityofvisitorstomany search engines is vitally important to websites. For online websites being referred by search engines. Therefore, businesses, a visitor to their website could mean a sale, ad anunderstandingofthissearchenginetrafficiscritical revenue,userregistration,orexposuretobranding. to the success of these websites. Understanding search engine traffic means understanding the under- In the context of web searching, the set of terms for lying intent of the query terms and the corresponding which a user searches is called the query. If a user enters a user behaviors of searchers submitting keywords. In query and then clicks on a result, these query terms are this research, using 712,643 query keywords from a embedded within the URL that is passed from the search popular Spanish music website relying on contextual enginetothewebsite.ThisURLiscalledthereferralURL, advertising as its business model, we use a k-means clustering algorithm to categorize the referral key- and the query terms within the referral URLare called the words with similar characteristics of onsite customer referral keywords. The webpage pointed to by the link behavior, including attributes such as clickthrough the user clicks is called the landing page. Both the referral rate and revenue. We identified 6 clusters of consumer URL and referral keywords provide important information keywords.Clustersrangefromalargenumberofusers tothewebsiteowner.Examplesofsuchinformationinclude who are low impact to a small number of high impact users. We demonstrate how online businesses can wheretrafficiscomingfrom(i.e.,whichsearchengine,for leverage this segmentation clustering approach to pro- example), what topics searchers are most interested in, and videamoretailoredconsumerexperience.Implications how a particular landing page is indexed by the search are that businesses can effectively segment customers engines. Therefore, it is important to understand and study to develop better business models to increase adver- the search keywords and search phrases that are bringing tising conversion rates. people from the search engines to the websites (Hackett & Parmanto, 2009). When analyzed appropriately, these Introduction referral keywords can provide insightful information about user behavior and user intent, from which website owners Manywebsitesrelyonsearchenginestodrivesubstantial can build better business models or provide more relevant portions of their traffic. Major search engines such as contenttovisitors. Google,Bing,andYahoo!usecomplexalgorithmstodeter- Manywebsitesmeasurethesuccessofavisitbyconver- minetherelevanceofapage(Brin&Page,1998).Websites sion rate, which is the ratio of visits that result in users thatappearonthefirstpageofthesearchresultsarelikelyto performing the end goal, as defined by the website owner, getmoretrafficbecausemostusersclickonfirst-pageresults divided by the number total visits Booth and Jansen, 2008. (Jansen&Spink,2004).Thesesearchenginesnotonlydrive Theendgoalofaconversionvariesdependingonthetypeof new visitors, but research has shown that repeat visitors website.Forwebsitesthatsellproducts,aconversionwould use search engines as navigational tools (Jansen, Spink, & be one where a shopper turns into an actual buyer. For a Pedersen, 2005). With search engines being the primary website that relies on contextual advertising, a conversion is one that results in a click on an advertisement.Websites ReceivedJuly7,2011;revisedJanuary6,2012;acceptedJanuary6,2012 and online businesses are continually looking for ways to © 2012 ASIS&T•Published online 21 March 2012 in Wiley Online improve conversion rate, as it’s reported that only about Library(wileyonlinelibrary.com).DOI:10.1002/asi.22640 threepercentofvisitsresultinaconversion(Betts,2001). JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY,63(7):1426–1441,2012 Contextual advertising is a successful business model (Jansen & Rieh, 2010; Marchionini, 1995).We are specifi- for many websites in which they generate revenue by dis- callyinterestedintheuseofquerytermsonsearchenginesas playingadsthatcloselymatchthecontentofthesite’spages indicators of intent, as our assumption is that these query (Broder,Fontoura,Josifovski,&Riedel,2007).Ifawebsite termscouldbethebasisforsegmentingvisitors(i.e.,poten- ownercandeterminewhichtypesofreferralkeywordsbring tialcustomers)toawebsite.Priorworkwouldindicatethat in high performing or low performing customers, based on thisisavalidassumption.Forexample,Broder(2002)pro- conversion rates for example, the website owner can then posedthreebroaduserintentclassifications—navigational, optimize the landing pages of the website to increase con- informational, and transactional—based on query terms. version rates for these consumers through personalized Using survey results, Broder reports that nearly 26% are content.Thisisanexampleofbehavioraltargetedadvertis- navigational, approximately 73% of queries are informa- ing, where the advertisements are personalized for users tional or transactional, with an estimated 36% are trans- based on their individual web search and browsing actional. (Note: The researcher placed some queries into behaviors(Yanetal.,2009). multiple categories.) Then, based solely on log analysis, Whatif,byusingthereferralkeyword,webmasterscould Border reports that 48% of the queries were informational, predict the onsite behavior of potential consumers sent to 20%navigationaland30%transactional.(Note:Weassume the site from search engines? What if the website owner themissing2%wereunclassifiableortheresultofrounding.) couldtellwhichreferralkeywordsaremorelikelytogener- In similar work, Rose and Levinson (2004) classified ate contextual advertising revenue and how much?What if searchenginequeriesusingthecategoriesofinformational, the website owner could know which referral keywords do navigational, and resource, along with hierarchical subcat- not make any revenue and somehow move those users off egoriesofeach.Indeterminingtheuserintent,theresearch- the site as quickly as possible? What if the website owner ers investigated using just the searcher’s query, the results could know which referral keywords produce the highest the searcher clicked on, and subsequent queries. Rose bounce rates and enhance the site in a way that retains the and Levinson (2004) reported that approximately 62% of visitor longer than one page view? These are some of the the queries are informational, 13% navigational, and 24% questionsthatmotivateourresearch.Toaddresstheseques- resource. The researchers report only small differences in tions, we develop clusters of users based on their referral results when using the additional information beyond the keywords and the associated behavioral characteristics and query.However,likethatofBroder(2002),thisresearchwas attributesonthewebsite. based on logs from the search engine and not the landing In the following sections, we first present a review of pagewebsite. the literature. We then discuss our research objectives Researchers have also examined automatically classify- forclusteringbasedonreferralkeywords.Inthemethodol- ing intent, which is related to the research that we propose ogy section, we review the k-means clustering algorithm, here.Forexample,Lee,Liu,andCho(2005)automatically along with the website and data used in this research. We classified informational and navigational queries using 50 then discuss our results and implications, including how queries collected from computer science students at a U.S. these findings could be used by an online business to university.KangandKim(2003)classifiedqueriesaseither improve the consumer experience and the conversion rate topicorhomepageusingseveraliterationsofclassification. onthesewebsites. The researchers report a classification rate of 91% using selectedTRECtopics(50topicand150homepagefinding) andportionsoftheWT10gtestcollection.Dai,Nie,Wang, Literature Review Zhao, Wen, and Li (2006) examined classifying whether Thetheoreticalbasisforthisresearchishumaninforma- ornotawebqueryhascommercialintent,notingthat38% tion processing, which is the methods that people use to of search queries have commercial intent. Baeza-Yates, acquire, interpret, manipulate, store, retrieve, and classify Calde´ron-Benavides, and Gonz´alez-Caro (2006) used information (Wilson, 2000). Web analytics is typically supervised and unsupervised learning to classify 6,042 themethodusedforunderstandinghumaninformationpro- Web queries as either informational, not informational, or cessingontheInternet,asthereismuchuser,searcher,and ambiguous. Jansen, Booth, and Spink (2008) provided a consumer information collected by logs and other means. comprehensive automated multilevel analysis, reporting TheWebAnalyticsAssociationdefineswebanalyticsasthe a 74% success rate in user intent classification using a process of measuring, collecting, analyzing and reporting decisiontreeapproach.Theseapproachesaresimilartowork website usage to understand and optimize web usage by Özmutlu, Çavdur, and Özmutlu, 2006 that focused on (Burby, Brown, & WAAStandards Committee, 2007), and topicalclassification. the methodological approach has been used in information However, these prior works all focused search engine science, by marketers, and by other researchers to study dataandnotthecorrespondinguserbehavioronthelanding and gain greater insight into user information behavior page website, which could provide additional insights. For (Penniman,2008;Peters,1993). example,Nettleton,Calderon,andBaeza-Yates(2006)used Forthisresearch,weareinterestedinasubsetofhuman 65,282 queries along with click stream data and clustered information processing, namely, information searching thesequeriesbasedonvariousparameterstolabelqueriesas JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 1427 DOI:10.1002/asi informational, navigational, or transactional. Fujii (2008) trajectory and time spent at each page. Phippen, Sheppard, presented a method for identifying navigational queries by andFurnell(2004)statethatcompaniescancoordinateand comparing them to the anchor text in webpages, using 127 audit website design by understanding user behavior using informational and 168 navigational queries.The researcher anarrayofwebmetrics. reportedthatanchortextcanbeusedforqueryclassification. By analyzing web usage patterns to segment users, one Cao and colleagues (2009) used a set of previous queries might be able to modify or enhance web systems in a way from a user session as well as the webpages retrieved by thateffectivelycaterstosegmentedInternettraffic.Websites these queries to topically classify queries based on tax- couldthenofferservicesinamorepersonalizedwaytotheir onomyofwebtopics.Kathuria,Jansen,Hafernik,andSpink users. In addition, segmentation of online visitors allows (2010)usedk-meansclusteringtoautomaticallyclusterweb advertisingnetworkstobehaviorallytargetadvertisements. queries into eight different clusters, six informational, one UnlikeotherformsofInternetadvertising,suchasspon- transactional,andonenavigational. sored search advertising or content advertising, behavioral However,thesepriorworkshavefocusedsolelyoniden- targetingisthepracticeofdisplayingadvertisementsbased tifyinguserintentofqueryterms.Inthisresearch,weextend on past user behaviors.Advertising has the potential to be thelineofinquirybyexamining(andpredicting)actualuser much more effective when using information science con- behavioronawebsitebyclusteringreferralkeywordsbased ceptssuchasrelevance(Saracevic,1975).AsYanandfellow on similar onsite behaviors. So, our research provides a researchers(2009)note,userswhoclickonthesameadver- linkagebetweentheuserintentworkfocusedonqueryterms tisementexhibitsimilarbehaviorsontheweb.Therefore,the and the consumer behaviors on the destination website. clickthroughrateofanonlineadvertisementcanbesignifi- Basedonthepriorworkshowingthatdifferentqueryterms cantlyimprovedbysegmentingusers.Yanetal.(2009)also are implicit indicators of intent, it would seem reasonable notes that segmenting using short-term behaviors is more that these query terms also act as gauges of different user effective for behavioral advertising than using long-term behaviorsbecausetheunderlyingintentmaybedifferent.In behaviors.So,bysegmentingvisitors,notonlycanwebsites fact, there is prior work that suggests this linkage. With offer more personalized services, but also the clickthrough generalwebsearching,researchershavedevelopeddifferent rateoftheadvertisementonthosewebsitescanbeimproved, classifications depending on the users’ browsing behaviors leadingtomorerevenuebeinggenerated. or the queries entered (Caramel, Crawford, & Chen, 1992; Despite the research on web searching, which drives Jansenetal.,2008;Marchionini,1995;Rozanski,Bollman, much of the traffic to websites, there is little published &Lipman,2001). researchthatattemptstofindouthoweffectivevariousseg- Given that visitors to a website (who are first searchers ments of traffic from search engines really is or how users on the search engine and may be viewed later as potential are behaving after arriving at the site from a search engine customers by an online business) arrive via different key- search. Addressing this issue has significant ramifications words,itisreasonabletoassumetheymightexhibitdifferent for areas from information science (e.g., information rel- behaviorsonthewebsite.Ifso,therewouldpotentialvalue evance and usefulness) to marketing (e.g., advertising and in segmenting these visitors based on these keywords and searchengineoptimization). behaviors. Such user-generated data can endow businesses In this study, we investigate whether or not one can with valuable information about understanding users better segmentcustomerstoawebsitebasedonuserbehavior.We and indicate needed modifications or enhancements in web leverage keyword referral data to a Spanish music website systems(Jansen,2009,p.2).Infact,Carmel,Crawford,and anduserbehaviorscollectedviaananalyticsprogram. Chen (1992) classified users into three different categories using a verbal protocol analysis technique. Cheung, Kao, Research Objective andLee(1998)usedwebtoolsthatanalyzedwebuserdata that allowed them to learn users’ access patterns. Buchner, Ourresearchobjectiveistoautomaticallyclassifyalarge Mulvenna,Anand,andHughes(1999)showedhowmining setofreferralkeywordsintouniqueclustersthataremean- web server traffic to discover patterns of user access could ingfulforanonlinebusiness. be used for the marketing and management of e-business The motivation for this research objective is to demon- and e-services. Rozanski, Bollman, and Lipman (2001) strate whether or not segmenting the market by referral segmented Internet users into seven different categories by keywords can provide actionable intelligence for online analyzingclick-streamdataandexploringsessioncharacter- businesses.Marketsegmentationistheprocessesofdividing istics. However, this study only used four session variables a market along some similarity where the market segments (session length, time per page, category concentration, and have something in common (Thomas, 2007), and it is con- site familiarity) to segment the users. Chen and Cooper sideredimportantfortailoringaspectsofabusiness,suchas (2001, 2002) used clustering and stochastic modeling to marketingandadvertising,toparticularcustomergroups. detectusagepatternsinawebinformationsystem.Banerjee To investigate this research objective, we use a referral andGhosh(2001)clusteredusersusingaweblogonaweb- keyword log from an online business website. In addition site. The study found six different clusters using weighted to the referral keywords, we also collect online consumer longest common subsequences that took into account the behaviors associated with these keywords, such as page 1428 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 DOI:10.1002/asi views, time on site, and revenue generated. Based on servicewherethewebsite(i.e.,publisher)allowsGoogleto theattributes,weemployedak-meansclusteringalgorithm postadvertisementsonthesiteinexchangeforaportionof (Kanungo etal., 2002) to segment clusters of customers the advertising revenue Google receives.This form of con- basedonreferralkeywordsandtheonsitebehaviorsassoci- textual advertising is a primary revenue source for many ated with those keywords. K-means clustering is a catego- websites. Figure1 shows a sample of BuenaMusica.com’s rizing and labeling algorithm based on means of groups of homepage, illustrating the site’s interface and features similardatapoints. duringthedatacollectionperiod. At the time of the study (June 1 to October 31, 2010), Research Design the Google search engine had indexed a total of 116,000 pagesofthedomainwww.BuenaMusica.com.Alexa.com,a Wefirstpresentourdatacollectionsite. webtrafficreportingcompany,hadassignedBuenaMusica- .com a worldwide traffic rank of 26,178. According DataCollectionFromBuenaMusica.com to Alexa.com, the site is particularly popular in South For this research study, we collected data from Buena- America where it is in the top 1,000 visited sites in three Musica.com, a Spanish-based entertainment business. The countries.InNicaragua,itisranked487,inGuatemala919, website offers customers the ability to play songs, watch and inVenezuela 929. In the United States, it has a 33,354 music videos, look up song lyrics, read artist biographies, traffic rank, and in China, it has a 106,573 traffic rank. check the latest artist news, and communicate with other So, the site is a well-trafficked website from a multitude usersinchatrooms,aswellaslistentostreamingradio.The of countries and, therefore, it is a good candidate for our site is financially supported by revenue generated through research. Further breakdown of worldwide traffic rank for advertisementsofGoogleAdSense,whichisanadvertising the site is shown in Table1. FIG.1. BuenaMusica.comhomepage.[Colorfigurecanbeviewedintheonlineissue,whichisavailableatwileyonlinelibrary.com.] JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 1429 DOI:10.1002/asi TABLE1. WorldwidetrafficrankaccordingtoAlexa.com. site more than 30 times per month. So, the traffic break- down is fairly typical, in that there are a lot of occasional BuenaMusica.comworldwidetrafficrank users, a fair portion of frequent users, and a small number Country Rank of core users. Nicaragua 487 Guatemala 919 Venezuela 929 DataCollectionandPreparation Honduras 1,038 DominicanRepublic 1,567 Thedatausedforthisresearchstudywerecollectedusing Bolivia 1,570 GoogleAnalytics,anonlinewebsiteanalyticstool,whichis Ecuador 2,619 widely used in the industry. This web-based tool generates Colombia 3,132 detailed statistics and reports about traffic and visitors to a Peru 3,428 website.GiventhewideuseofGoogleAnalytics,weexpect Mexico 4,278 CostaRica 7,943 the procedures used in this research to be implementable Argentina 16,013 bymanyonlinebusinessesandotherwebsites.GoogleAna- Spain 29,098 lytics is integrated into a site by a page tag. A snippet of UnitedStates 33,354 JavaScript code, known as the Google Analytics Tracking China 106,673 Code (GATC), is embedded on every page of the website. This code has a unique identification tag that identifies the website with the Google Analytics account holder. When- ever the page is loaded, the snippet of code runs, collects TABLE2. VisitorsbycountrybreakdownaccordingtoAlexa.com. visitor data, and sends it to Google servers for processing Visitorsbycountry andaggregation.Thestatisticscollectedrangefromthetime auserspentonthesite,tothenumberofpageviews,browser Country Percentofsitetraffic used, operating system of the computer, as well as screen Venezuela 22.1 Mexico 19.0 resolutionofthecomputermonitor.Allofthisinformationis UnitedStates 13.7 available to the account holder via the Google Analytics Colombia 7.9 interface. Peru 6.0 Additionally,theanalyticstoolcollectsthereferralinfor- DominicanRepublic 4.7 mationassociatedwiththeparticularwebsite.Referralsare Guatemala 4.3 Spain 3.7 page visits from different websites or search engines that Nicaragua 3.1 directtraffictoaparticularwebsiteviaahyperlink.Arefer- China 2.7 ralURListhewebaddressofthesearchengineorwebsite thatisdirectingtraffictoanothersite.Forexample,saysite A(e.g.,http://www.twitter.com/buena_musica)hasalinkto site B (e.g., http://www.buenamusica.com). The traffic that ThepercentageofvisitorsbycountryaccordingtoAlexa- site B receives from site A via the link is called referral .comcanbeseeninTable2.AccordingtoAlexa.com,visi- traffic.TheURLofsiteAwherethelinkisplacediscalled torsfromVenezuela,MexicoandtheUnitedStatesmakeup the referral URL. In our example, the referral URL would 54.8%oftotalsitetraffic. behttp://www.twitter.com/buena_musica.Asimilarprocess Concerning demographics, Figure2 provides a break- happensfortrafficdirectedfromwebsearchengines. downofthesite’sU.S.visitordemographicsbasedondata Figure4featuresascreenshotofatypicalGooglesearch from Quantcast.com, a widely used web traffic firm that resultspage. provideswebanalyticsdata. Looking at Figure4, the phrase inside the green box As seen in Figure3, the majority of users in the United is the query (“buena musica”) submitted by the searcher. States who visit BuenaMusica.com are Hispanics between The text inside the blue box is the URLof the results page theagesof13and34.Inaddition,69%ofU.S.usersdonot (http://www.google.com/search?hl=en&q=buena+musica). have a college education and almost half make less than Embedded in this URL are the terms of the query (e.g., $30,000 a year. However, this is certainly tied to some buena+musica).Whentheuserclicksonaresult,theURLin degreewiththeageofthewebsiteusers,since36%ofusers the blue box gets passed to the website pointed to by link areundertheageof18. that the user has clicked on (i.e., the specific webpage Figure3 shows the global traffic frequency of users is called the landing page). If the landing page has an accordingtoQuantcast.com. analyticstrackingtool,thistoolcancollectandanalyzethe From Figure3, 76% of users are passers-by, meaning information within the referral URL for aggregation and thattheyhaveasinglevisitoverthecourseofamonth.The analysis. Referral keywords and the referral URL provide second row of Figure3 shows that 24% of users are regu- invaluable insight to webmasters and website owners. The larswhovisitthesitemorethanoncebutlessthan30times referral URL provides information such as where traffic is a month. Less than 1% of users are addicts who visit the originating, and the referral keywords provide insights into 1430 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 DOI:10.1002/asi FIG.2. U.S. demographics breakdown according to Quantcast.com. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] whattheusersaresearchingforthatultimatelyleadsthemto TABLE3. GoogleAnalyticsreferralkeywordlogwithbrowsingbehavior the website. GoogleAnalytics is particularly useful for the attributes. researchpresentedheresinceitcollectstrafficsourcestatis- Pagerper Averagetime New Bounce tics including referral keywords along with user behaviors Keyword Visits visit onsite visits rate onthewebsiteassociatedwitheachreferralkeyword.Each referralkeywordiscollectedwiththefollowingattributes: buenamusica 773,030 5.81 685.18 0.35 0.29 buenamusica.com 688,533 6.26 750.57 0.32 0.26 Buenamusica 318,636 5.50 717.22 0.32 0.28 • Visits: the number of visits to the site that the keyword Musica 203,509 6.37 561.75 0.63 0.28 generatedinagiventimeperiod • Pagespervisit:theaveragenumberofpagesviewedduringa visit ContentAdvertising • Averagetimeonsite:theaveragedurationofavisittothesite • Percentage of new visits: the percentage of visits by people BuenaMusica.comissupportedbyrevenuegeneratedby whohavenevervisitedthesitebefore(withinagivenperiod advertisements from Google AdSense, which is a service andbasedonIPaddressandcookie) wherewebsiteownerscanofferadvertisementsbasedonthe • Bouncerate:thepercentageofsingle-pagevisits(i.e.,visits website’s content. This service is implemented similar to where the user left the site from the landing page without GoogleAnalyticswheresnippetsofcodeareplacedonweb browsingotherpages) pages and when the pages are viewed, advertisements are displayed. Revenue is generated when a visitor clicks on Each of these attributes can be aligned to the referral an advertisement. The AdSense account publisher (e.g., keywordthatbroughttheuserstothewebsite.Table3shows BuenaMusica.com) makes a quantity of money based on anexampleofaGoogleAnalyticsreferralkeywordlog: the amount the advertiser is paying Google to service a JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 1431 DOI:10.1002/asi displayed on the site, which generates the revenue. Page impressions is the number of ads displayed to these con- sumers. CTR (click through rate) is ad clicked divided by page impressions. eCPM is the cost per 1,000 ads displays to these consumers. Data Methodology Wenowdiscussourresearchdata. DataCollection We collected data on referral keywords, visitor traffic, and advertising revenue data on BuenaMusica.com from June 1, 2010 through October 31, 2010.Atotal of 900,795 referralkeywordrecordswerecollectedduringthe5-month FIG.3. GlobaltrafficfrequencybreakdownaccordingtoQuantcast.com. data collection period. We extracted each month’s data [Color figure can be viewed in the online issue, which is available at individually from Google Analytics web interface in a tab wileyonlinelibrary.com.] delimited file format with 20,000 keywords in each batch, since GoogleAnalytics limited the size of each export.We then imported these batches into a relational database for particular advertisement. This quantity can range from a datanormalizationandaggregation.Somereferralkeywords fractionofapennyperclicktoacoupledollarsormoreper were repeated within each batch. After normalization and click.The amount of money depends on the content of the aggregation, the total number of keywords used for data page,theuser’sdemographics,thepastperformancehistory analysis was 712,643. Of the 900,795 keywords, 188,152 ofthesite,andotherfactors. wereduplicatesofsomesort.Wediscussdatanormalization Because both Analytics and AdSense are Google prod- and aggregation in more detail in the data preparation ucts, one can integrate them to share data. This integration section. is done in the Google Analytics web interface where the account holder can allow the Analytics account to receive DataPreparation AdSensedata.Whenenabled,theGoogleAnalyticsprogram collects the ad revenue information of the site from the Ourfirststepinanalysiswastonormalizealltheattribute AdSense account, aligning these attributes with referral values since some were ratios (e.g., bounce rate, CTR and keywords. eCPM) and some were absolute numbers (e.g., pages_per- For this research, we configured Google Analytics and _visit, time on site, new visits, bounce rate, revenue, ads AdSense applications to share data, allowing the referral clicked).Additionally,hightraffickeywordssuchas“buena keywordstohavenotonlythewebsiteattributesbutalsothe musica” (i.e., branded keywords in the search engine mar- additionalAdSenserevenuedata.TheAdSenseattributesare ketingrealm)or“musica”bringinthousandsofvisitstothe asfollows: site as opposed to other queries that only bring in a few visits.Becauseofthenatureofthek-meansclusteringalgo- • Revenue: the total amount of revenue generated for the rithm, clustering using ratios and absolute numbers would website skew the results (i.e., comparing apples to oranges). To • Adsclicked:thenumberofadsclicked addressthisissueandgetamoreaccuraterepresentationof • Page impressions: the number of viewed pages where ads the keyword attributes, we clustered using attributes that weredisplayed wereratiosorpercentages.However,wewantedtouseallof • CTR(clickthoughrate):theratioofthenumberofadsclicked the attributes possible, so we created ratios using the ontothenumberofadsviewed attributes that were absolute numbers. The additional three • eCPM(costpermillionimpressions):theestimatedrevenue fromAdSenseperthousandadpageviews ratiosthatwegeneratedforthisresearchareasfollows: Table4 is an example a referral keyword log with addi- • Averagerevenue:theaveragerevenuethesitemakesfroma visit based on a given referral keyword. We calculated this tionalrevenueattribute: ratiousing:totalrevenue/numberofvisits From Table4, keyword is the referral keywords (i.e., • Averageadsclicked:theaveragenumberofadsclickedbased the query that the user submitted to the search engine). on a given referral keyword.We calculated this ratio using: Revenue is the amount of ad revenue generated by con- adsclicked/numberofvisits sumers who arrived at the site from a search engine using • Averageimpressions:theaveragenumberofadimpressions the referral keywords within a given period. Ads clicked based on a given referral keyword.We calculated this ratio is the number of times those consumers clicked on ads using:pageimpressions/numberofvisits 1432 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 DOI:10.1002/asi FIG.4. Googlesearchresultspageexample.[Colorfigurecanbeviewedintheonlineissue,whichisavailableatwileyonlinelibrary.com.] TABLE4. Google Analytics referral keyword log with revenue and within those 30 minutes. If a user visited the site within GoogleAdSenseattributes. 30 minutes and then searched using a term that brought him/her back to the site within those 30 minutes, a visit Ads Page Keyword Revenue clicked impressions CTR eCPM would not be recorded under that or any subsequent keywords. buenamusica 1,138.95 37,433 4,116,684 0.0091 0.27 Forreferralkeywordsthathadzerovisits,wecalculated buenamusica.com 883.52 33,518 3,974,888 0.0084 0.22 the three additional ratios using the following formula: Buenamusica 336.61 10,146 1,611,535 0.0062 0.20 Musica 687.77 30,224 1,170,304 0.0246 0.55 IF (visits>0, value/visits, value). The formula states that if the number of visits is greater than zero, divide the value Note.CTR=clickthoughrate. (revenue, ads clicked, and impressions) by the number of visits. Otherwise, simply copy the value to the respective averagesfield. Using these three additional ratios, we were able to Once we had prepared the data set, the final k-means cluster using all attributes and at the same time without clustering was done with the following nine attributes usingrawattributessuchasvisitsoradsclickedthatwould associated with each referral keyword: pages per visit, skew the results.At the same time, we did not lose any of averagetimeonsite,percentageofnewvisits,bouncerate, the additional information provided by the data collection CTR,eCPM,averagerevenue,averageadsclicked,average applications. impressions. It is worth mentioning that some keywords in the logs Once the formula was applied to all keywords, the showed that they bring in zero visits. This is because a spreadsheet was imported into SPSS, a statistical analysis visit is not recorded if the user has remained inactive for computerprogram.Afterthekeywordswereaggregated,we more than 30 minutes or has cleared the browser’s cache implementedk-meansclusteringonentiredataset. JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 1433 DOI:10.1002/asi DataMethodologyandAnalysis elbow method (Sugar & James, 2003), which is based on the amount of variance in the data that is explained by Clusteringofthedatawasdoneusingthek-meansclus- addinganadditionalcluster. teringalgorithm.Thisparticularalgorithmusesanunsuper- Tocheckforeffectofdataorderonclusteringresults,we vised learning technique that makes it ideal for clustering didathree-foldcrossvalidation,endingwithsixclusterson bigdatasets.Theobjectiveofthealgorithmistosegmentn each validation. Given this result, we believe our method- items(keywordattributesinourcase)intokclusterswhere ologicalapproachtobevalid. eachitem(i.e.,keyword)belongstotheclusterthatisofthe Each of the six clusters represents a grouping of key- nearest mean. Items in the same cluster are most similar to wordsorsearchkeyphrasesthatsharecommonalityamong eachandmostdissimilartothoseinotherclusters. the attributes specified. Each cluster is also dissimilar with The k-means clustering algorithm attempts to maximize theotherclusters. the mean of each cluster while at the same time tries to minimizethestandarddeviationintheseclusters.Thealgo- rithmusesakamountofcentroids.Centroidscanbedefined Results asrandompointsinthedatathatserveasthecenterofthat Table5showsthefrequencyofeachcluster.Eachrowis cluster.TheEuclidiandistanceisdeterminedby: adifferentclusterandthetwocolumnsarethefrequencyand percentofthatclusterrelativetotheentiredataset. D = ∑n (x −x )2 Table5 shows an uneven distribution of the cluster fre- ij ki kj quency.Cluster1isthebiggestclusterwith83.3%followed k=1 by cluster 2 with 11.3%.The next biggest cluster is cluster where 5with3.9%followedbycluster4with1.2%.Cluster3and cluster6eachhad0.3%andlessthan0.0%,respectively. Dij distancebetweencasesiandj Table6 shows the final cluster centers (i.e., means) and xki valueofvariableXkforcasej standard deviation (SD) for each cluster. Each column is a differentclusterandeachrowisanattributeofthatcluster. Belowisastep-by-stepbreakdownofthealgorithm: Thefinalclustertableprovidesinsightfulinformationabout userbehaviorbasedonwebfactors,givingusaconsolidated 1. Randomly choose k centroids and use them as initial snapshotoftheclustergroupings. centroids(centers). We now discuss each cluster and the implications for 2. For each item, locate the closest center and assign the online businesses by categorizing each cluster along two itemtotheclusterthatthenearestcentroidsbelongsto. axes,onsitebehaviorandrevenuegeneration.Onsitebehav- 3. Update the centroids of each cluster derived from the ioraddressestheengagementofthevisitorwhileonthesite timesinthatcluster. a. Thenewupdatedcentroidwillbethemean(average) (e.g.,percentageofnewvisits,pagespervisit,averagetime ofallitemsthatbelongtothatcluster. onsite,bouncerate,CTR,averageadsclicked,andaverage 4. Untilnoitemswitchesclusters,repeatsteps2and3. impressions),andtherevenuegenerationaddressesthebusi- ness concern from the perspective of the website owner For data analysis, we experimented with a minimum of (e.g., eCPM, average revenue). For both onsite behavior twoclustergroupingsandamaximumof10clustergroup- and revenue generation, we classify each cluster as high, ings. Early analysis and comparison of frequency numbers medium,orlowrelatively. foreachclustershowedthatinallofthoseclustergroupings, there was always one cluster that only had two keywords. Cluster1—LowEngagement,LowRevenue Uponexamination,thesetwokeywordsweredeterminedto beoutliersbecausetheyhadaverylargeaveragetimeonsite Thisclusterwasbothlowengagementandlowrevenue. (about35,000seconds).Consideringthatwewereclustering With 593,532 keywords, cluster 1 accounted for 83.3% of over 700,000 keywords, we believe that the removal of the dataset. Of the visits in this cluster, 81.21% were new these two outlier keywords was not going to affect our end results,sothosetwokeywordswereremovedfromourdata topreventskewingtheclusters. TABLE5. Frequencyandpercentagesofsixclusters. We re-ran the k-means clustering groupings again from Cluster Frequency Percent 2 to 10 clusters (iterations up to 20 for each clustering attempt), and saw that the frequency distribution for each 1 593,532 83.3 cluster was more realistic based on our known traffic 2 80,583 11.3 patterns reported above. After analysis and comparison 3 2,069 0.3 4 8,399 1.2 betweenthesenineclustergroupings,thegroupofsixclus- 5 27,715 3.9 ters for the 712,643 referral keyword most accurately 6 343 ~0 described the customer segmentation while showing the Total 712,641 100.0 maximum differences between each cluster based on the 1434 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 DOI:10.1002/asi TABLE6. Finalclustercentersofeachcluster. Cluster 1 2 3 4 5 6 Attributes Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Pagespervisit 1.59 1.42 5.61 4.4 34.59 28.5 18.32 16.7 9.32 8.9 71.74 58.1 Averagetimeonsite(sec.) 28.74 56.20 474.15 168.7 4,780.92 899.1 2,508.16 475.7 1,252.47 280.3 9,462.36 2463.6 Percentageofnewvisits 0.81 0.37 0.74 0.4 0.64 0.4 0.66 0.4 0.69 0.4 0.66 0.5 Bouncerate 0.67 0.44 0.11 0.2 0.01 0.1 0.03 0.1 0.05 0.1 0.004 0.04 CTR 0.02 0.13 0.02 0.1 0.009 0.04 0.015 0.07 0.022 0.09 0.006 0.03 eCPM 0.72 10.26 0.95 9.0 0.35 1.7 0.63 6.0 0.84 6.4 0.25 1.0 Averagerevenue($) 0.001 0.04 0.004 0.02 0.009 0.04 0.007 0.04 0.005 0.04 0.014 0.05 Averageadsclicked 0.035 0.19 0.11 0.4 0.18 0.5 0.17 0.5 0.14 0.4 0.26 0.6 Averageimpressions 1.6 1.80 5.11 4.3 31.74 28.1 16.65 16.5 8.4 8.6 63.96 59.2 Classificationengagement Low Low High Medium Medium High Classificationrevenue Low Medium High Medium Low High Note.SD=standarddeviation.CTR=clickthoughrate. visits (i.e., first time visitors within the preceding 30 days) the “www” and the “.com.” (Note:We provide the English and about 67% of visits were bounces (i.e., the visitor translation as most of the referral keywords were in viewedonepageviewonly).Forcluster1,usersviewed1.59 Spanish). pagespervisit,byfarthelowestofanycluster.Usersinthis Fortheonlinebusiness,thesecustomersaretheonesthat clustergenerallyspent28secondsonthesitepervisit.The generatemostofthetrafficbutgeneratethelowestaverage numberofadpageimpressionspervisitwas1.60andthead revenuepervisit.Insomeaspects,theyareprimarilycost,as clickthrough rate was 2.2%. In terms of revenue, in $0.001 they use server cycles, access information, and generate inrevenuepervisit,withusersinthisclusterclickingonthe littlerevenue.However,theyrepresentasignificantportion least number of ads. One thousand ad page views (eCPM) of the traffic to the website, which aids in website ranking. in this cluster generated $0.25. So, for engagement, those So,theydoprovidesomeindirectbenefit. visitingthewebsiteinthisclusteraretypicallynewvisitors, who spend very little time on the site, and visit very little Cluster2—LowEngagement,MediumRevenue content.Intermsofrevenue,visitorsintheclustergenerated byfarthelowestrevenue. Cluster 2 was low engagement and medium revenue Search phrases in cluster 1 typically comprised natural users. There were 80,583 keywords in cluster 2 that queriesthatareexpressedintheformofaquestionusually accounted for 11.3% of the dataset. For cluster 2, users asking for some specific information. The information that viewed5.61pagespervisit,higherthancluster1butlower querieslookedforrangedfromwantingtoknowanartist’s thanthatoftheotherclusters.Usersinthisclustergenerally nationality, names of members of bands, and who wrote spent 474 seconds on the site per visit and 84 seconds per specificsongs.Forexample,therewerequeriesthatlooked pageview.Ofthetotal,74%ofvisitsinthisclusterwerenew forspecificinformationaboutanartistsuchas“whatisthe visits,andabout11%ofvisitswerebounces,whichisavery realnameof{artistname},”“whois{artistname},”“where lowbouncerate.Thenumberofadpageimpressionspervisit is {artist name} from.” The queries asking for an artist was5.11andtheadclickthroughratewas2.76%,resultingin nationality were composed of “nationality {artist name}.” $0.004 in revenue per visit. One thousand ad page views Queries asking for names of members of a specific band (eCPM)inthisclustergenerated$0.95.Intermsofengage- were composed of “names of the members of {band ment,visitorsinthisclusterweremostlikecluster1.Interms name}.” Cluster 1 also had queries that were in the form ofrevenue,thesevisitorsweremostlikecluster4. of requests such as “I want to know” followed by any of Searchphrasesincluster2alsoconsistedofqueriesthat the previously aforementioned examples. In addition, there looked for information using broader terms than those in werequeriesthatwantedtoknowwhowrotespecificsongs cluster1.Thequerieslookedforinformationonartistsabout suchas“whowrote{songname}.”Also,thereweresearch theirmusicproductionssuchasdiscographiesandlessabout phrases that looked for lyrics of songs and the query was theirpersonallivesasincluster1.Queriesinthisclusterhad composedof“lyricsofthesong{songname}.”Inaddition, a variety of combinations and were ordered in a variety of there were queries that actually seemed to be malformed waysthatincludedanartistnameandthetermbiographyor URLs. These queries begin with “www.” followed by any discography. For example, some of the queries were com- combination of genre, song name, or video name with a posedof“biography{artistname}”or“discography{artist spaceortyposomewhereinthemiddleofthetextbetween name}.”Inadditiontowantingtoknowartists’information, JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—July2012 1435 DOI:10.1002/asi
Description: