Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow Lessons Learned and Future Directions Gabriel Ant oniu, Inria Joint work with Radu Tudoran, Benoit Da Mota, Alexandru Costan, Elena Apostol, Bertrand Thirion (co-PI for A-Brain), Ji Liu, Luis Pineda, Esther Pacitti, Patrick Valduriez (co-PI for Z-CloudFlow) and the Microsoft Azure tea m from MSR ATL Europe EIT Digital Future Cloud Symposium, Rennes, 19-20 October 2015 Inria Teams Involved in Cloud-Related Projects of the MSR-Inria Joint Centre INRIA Lille Nord Europe KERDATA: Data Storage and Processing INRIA Paris Rocquencourt INRIA Nan cy Grand Est INRIA INRIA Rennes Saclay PARIETAL: Neuroimaging Bretagne Atlantique Île-de-France INRIA Grenoble Rhône-Alpes INRIA Bordeaux Sud-Ouest INRIA Sophia Antipolis Méditerranée ZENITH: Scien=fic Data Management - 2 2 KerData’s Focus: How to efficiently store and share data at large scale for next-generation, data-intensive applications? • Scientific challenges • Massive data • Geographically distributed • Fine-grain access (MB) for reading and writing • High concurrency • Without locking • Major goal: high-throughput under heavy concurrency • Our contribution – Design and implementation of distributed algorithms – Validation with real apps on real platforms with real users 3 Motivating Application: A-Brain Detect risk factors for brain diseases Brain image Genetic data finding associations: p( , ) 106 106 – DNA array (SNP/CNV) – Anatomical MRI – gene expression data – Functional MRI – others... – Diffusion MRI >2000 subjects IEEE Cluster’15, Chicago, USA, 10 September 2015 4 Approach: A-Brain as Map-Reduce Processing 5 5 Challenges: Overview Multi-‐site Enabling Sprcoacliensgs itnhge MapReduce sdcisiecnotviefircy psrEconaclaeebs slsicinnieggn l tairfigce -‐ Data management across sites High-‐Performance Big Data Management Across Cloud Data Centers High-‐ Optimize inter-‐ performance site transfers streaming Configurable Cloud-‐provided Streaming cost-‐performance Transfers Service across tradeoffs 6 cloud sites Challenges: Overview Multi-‐site Enabling Sprcoacliensgs itnhge MapReduce sdcisiecnotviefircy psrEconaclaeebs slsicinnieggn l tairfigce -‐ Data management across sites High-‐Performance Big Data Management Across Cloud Data Centers Optimize High-‐ inter-‐site performance transfers streaming Configurable Cloud-‐provided Streaming cost-‐performance Transfers Service across tradeoffs 7 cloud sites Data Management on Public Clouds Cloud Compute Nodes Cloud-‐provided storage service Computa.on-‐to-‐data latency is high! 8 TomusBlobs: Leverage Virtual Disks • Colloca.ng computa.on and data in PaaS clouds: • Federate virtual disk of compute nodes • Self-‐configura.on, automa.c deployment and scaling of the data management system • Apply to MapReduce and Workflow processing 9 Leveraging TomusBlobs for MapReduce Processing Map Map Map Client Azure Queues Reduce Reduce • New MapReduce prototype (no Hadoop at that point on Azure) • Relies on versioning to support high throughput under heavy concurrency, leveraging BlobSeer (KerData, Inria, Rennes) 10
Description: