ebook img

Deployable Machine Learning for Security Defense: Second International Workshop, MLHat 2021, Virtual Event, August 15, 2021, Proceedings (Communications in Computer and Information Science) PDF

163 Pages·2021·9.763 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deployable Machine Learning for Security Defense: Second International Workshop, MLHat 2021, Virtual Event, August 15, 2021, Proceedings (Communications in Computer and Information Science)

Gang Wang Arridhana Ciptadi Ali Ahmadzadeh (Eds.) Communications in Computer and Information Science 1482 Deployable Machine Learning for Security Defense Second International Workshop, MLHat 2021 Virtual Event, August 15, 2021 Proceedings Communications in Computer and Information Science 1482 Editorial Board Members Joaquim Filipe Polytechnic Institute of Setúbal, Setúbal, Portugal Ashish Ghosh Indian Statistical Institute, Kolkata, India Raquel Oliveira Prates Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil Lizhu Zhou Tsinghua University, Beijing, China More information about this series at http://www.springer.com/series/7899 Gang Wang Arridhana Ciptadi (cid:129) (cid:129) Ali Ahmadzadeh (Eds.) Deployable Machine Learning for Security Defense Second International Workshop, MLHat 2021 Virtual Event, August 15, 2021 Proceedings 123 Editors Gang Wang Arridhana Ciptadi University of Illinois at Urbana-Champaign Truera Inc. Urbana,IL, USA RedwoodCity, CA,USA Ali Ahmadzadeh BlueHexagonInc. Sunnyvale,CA,USA ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer andInformation Science ISBN 978-3-030-87838-2 ISBN978-3-030-87839-9 (eBook) https://doi.org/10.1007/978-3-030-87839-9 ©SpringerNatureSwitzerlandAG2021 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynow knownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictionalclaimsin publishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface In recent years, we have seen machine learning algorithms, particularly deep learning algorithms,revolutionizingmanydomainssuchascomputervision,speech,andnatural languageprocessing.Incontrast,theimpactofthesenewadvancesinmachinelearning is still limited in the domain of security defense. While there is research progress in applying machine learning for threat forensics, malware analysis, intrusion detection, and vulnerability discovery, there are still grand challenges to be addressed before a machine learning system can be deployed and operated in practice as a critical com- ponent of cyber defense. Major challenges include, but are not limited to, the scale of the problem (billions of known attacks), adaptability (hundreds of millions of new attacks every year), inference speed and efficiency (compute resource is constrained), adversarial attacks (highly motivated evasion, poisoning, and trojaning attacks), the surgingdemandforexplainability(forthreatinvestigation),andtheneedforintegrating humans (e.g., SOC analysts) in the loop. To address these challenges, we hosted the second International Workshop on DeployableMachineLearningforSecurityDefense(MLHat2021).Theworkshopwas co-located with 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2021). This workshop brought together academic researchers and industry practitioners to discuss the open challenges, potential solutions, and best practices to deploy machine learning at scale for security defense. The goal was to definenewmachinelearningparadigmsundervarioussecurityapplicationcontextsand identify exciting new future research directions. The workshop had a strong industry presencetoprovideinsightsintothechallengesindeployingandmaintainingmachine learningmodelsandthemuch-neededdiscussiononthecapabilitiesthatstate-of-the-art systems have failed to provide. Theworkshopreceivedsevencompletesubmissionsas“novelresearchpapers”.All ofthesubmissionsweresingle-blind.Eachsubmissionreceivedthreereviewsfromthe Technical Program Committee members. In total, six full papers were selected and presented during the workshop. August 2021 Gang Wang Arridhana Ciptadi Ali Ahmadzadeh Organization Organizing and Program Committee Chairs Gang Wang University of Illinois at Urbana-Champaign, USA Arridhana Ciptadi Truera, USA Ali Ahmadzadeh Blue Hexagon, USA Program Committee Sadia Afroz Avast, USA Siddharth Bhatia National University of Singapore, Singapore Wenbo Guo Pennsylvania State University, USA Zhou Li UC Irvine, USA Fabio Pierazzi King’s College London, UK Alborz Rezazadeh LG AI Research Lab, Canada Gianluca Stringhini Boston University, USA Binghui Wang Duke University, USA Ting Wang Pennsylvania State University, USA Contents Machine Learning for Security STAN: Synthetic Network Traffic Generation with Generative Neural Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Shengzhe Xu,ManishMarwah,MartinArlitt, and NarenRamakrishnan Machine Learning for Fraud Detection in E-Commerce: A Research Agenda. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Niek Tax, Kees Jan de Vries, Mathijs de Jong, Nikoleta Dosoula, Bram van den Akker, Jon Smith, Olivier Thuong, and Lucas Bernardi Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-trained Language Models. . . . . . . . . . . . . . . . . . . . . . . 55 Guanqun Yang, Shay Dineen, Zhipeng Lin, and Xueqing Liu Malware Attack and Defense DEXRAY: A Simple, yet Effective Deep Learning Approach to Android Malware Detection Based on Image Representation of Bytecode. . . . . . . . . . 81 Nadia Daoudi, Jordan Samhi, Abdoul Kader Kabore, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein Attacks on Visualization-Based Malware Detection: Balancing Effectiveness and Executability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 HadjerBenkraouda,JingyuQian,HungQuocTran,andBerkayKaplan A Survey on Common Threats in npm and PyPi Registries . . . . . . . . . . . . . 132 Berkay Kaplan and Jingyu Qian Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Machine Learning for Security STAN: Synthetic Network Traffic Generation with Generative Neural Models B Shengzhe Xu1( ), Manish Marwah2, Martin Arlitt2, and Naren Ramakrishnan1 1 Department of Computer Science, Virginia Tech, Arlington, VA, USA [email protected], [email protected] 2 Micro Focus, Santa Clara, CA, USA {manish.marwah,martin.arlitt}@microfocus.com Abstract. Deep learning models have achieved great success in recent years but progress in some domains like cybersecurity is stymied due to apaucityofrealisticdatasets.Organizationsarereluctanttosharesuch data,eveninternally,duetoprivacyreasons.Analternativeistousesyn- theticallygenerateddatabutexistingmethodsarelimitedintheirability tocapturecomplexdependencystructures,betweenattributesandacross time. This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic syn- thetic network traffic datasets for subsequent downstream applications. Ournovelneuralarchitecturecapturesbothtemporaldependenciesand dependence between attributes at any given time. It integrates convo- lutional neural layers with mixture density neural layers and softmax layers, and models both continuous and discrete variables. We evaluate theperformanceofSTAN intermsofqualityofdatagenerated,bytrain- ing it on both a simulated dataset and a real network traffic data set. Finally, to answer the question—can real network traffic data be sub- stituted with synthetic data to train models of comparable accuracy?— we train two anomaly detection models based on self-supervision. The resultsshowonlyasmalldeclineinaccuracyofmodelstrainedsolelyon syntheticdata.Whilecurrentresultsareencouragingintermsofquality ofdatageneratedandabsenceofanyobviousdataleakagefromtraining data, in the future we plan to further validate this fact by conducting privacy attacks on the generated data. Other future work includes val- idating capture of long term dependencies and making model training more efficient. 1 Introduction Cybersecurity has become a key concern for both private and public organi- zations, given the prevalence of cyber-threats and attacks. In fact, malicious cyber-activity cost the U.S. economy between $57 billion and $109 billion in 2016 [33], and worldwide yearly spending on cybersecurity reached $1.5 trillion in 2018 [29]. (cid:2)c SpringerNatureSwitzerlandAG2021 G.Wangetal.(Eds.):MLHat2021,CCIS1482,pp.3–29,2021. https://doi.org/10.1007/978-3-030-87839-9_1

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.