ebook img

Mastering Azure Analytics : Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark PDF

412 Pages·2019·31.3467 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mastering Azure Analytics : Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

Description:

Table of Contents

Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1. Enterprise Analytics Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The Analytics Data Pipeline 1

Data Lakes 2

Lambda Architecture 3

Kappa Architecture 5

Choosing Between Lambda and Kappa 6

The Azure Analytics Pipeline 6

Introducing the Analytics Scenarios 9

Example Code and Example Data Sets 11

What You Will Need 11

Broadband Internet Connectivity 11

Azure Subscription 11

Visual Studio 2015 with Update 1 11

Azure SDK 2.8 or Later 15

Summary 16

2. Getting Data into Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Ingest Loading Layer 17

Bulk Data Loading 19

Disk Shipping 19

End User Tools 35

Network-Oriented Approaches 52

Stream Loading 74

Stream Loading with Event Hubs 75

iii

Summary 76

3. Storing Ingested Data in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

File-Oriented Storage 77

Blob Storage 79

Azure Data Lake Store 84

HDFS 90

Queue-Oriented Storage 94

Blue Yonder Scenario: Smart Buildings 95

Event Hubs 96

IoT Hub 111

Summary 122

4. Real-Time Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Stream Processing 123

Consuming Messages from Event Hubs 125

Tuple-at-a-Time Processing in Azure 129

Introducing HDInsight 129

Storm on HDInsight 129

EventProcessorHost 170

Azure Machine Learning 174

Summary 174

5. Real-Time Micro-Batch Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Micro-Batch Processing in Azure 175

Spark Streaming on HDInsight 175

Storm on HDInsight 192

Azure Stream Analytics 199

Summary 206

6. Batch Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Batch Processing with MapReduce on HDInsight 209

Apache Hadoop MapReduce 210

Batch Processing with Hive on HDInsight 213

Internal and External Tables 214

Partitioning Tables 214

Views 215

Indexes 215

Databases 216

Using Hive on HDInsight 216

Storage on HDInsight 218

Batch Processing Blue Yonder Airports Data 219

iv | Table of Contents

Creating an External Table 220

Creating an Internal Table 225

Batch Processing with Pig on HDInsight 228

Batch Processing with Spark on HDInsight 229

Batch Processing Blue Yonder Airports Data 232

Creating an External Table 233

Batch Processing with SQL Data Warehouse 237

Using SQL Data Warehouse 240

Batch Processing Blue Yonder Airports Data 240

Storing the Credentials to Azure Storage 241

Batch Processing with Data Lake Analytics 247

Using Data Lake Analytics 249

Batch Processing Blue Yonder Airports Data 250

Processing with U-SQL 250

Batch Processing with Azure Batch 258

Orchestrating Batch Processing Pipelines with Azure Data Factory 259

Summary 260

7. Interactive Querying in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

Interactive Querying with Azure SQL Data Warehouse 263

Partitions and Distributions 263

Indexes 265

Interactive Exploration of the Blue Yonder Airports Data 266

Interactive Querying with Hive and Tez 269

Indexes 271

Partitions 271

Interactive Exploration of the Blue Yonder Airports Data 271

Interactive Querying with Spark SQL 278

Indexes 278

Partitions 278

Interactive Exploration of the Blue Yonder Airports Data 279

Interactive Querying with USQL 283

Interactive Exploration of the Blue Yonder Airports Data 283

Summary 285

8. Hot and Cold Path Serving Layer in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

Azure Redis Cache 290

Redis in the Speed Serving Layer 291

Document DB 296

Document DB in the Speed Serving Layer 299

Document DB in the Batch Serving Layer 302

SQL Database 303

Table of Contents | v

SQL Database in the Speed Serving Layer 305

SQL Database in the Batch Serving Layer 311

SQL Data Warehouse 311

HBase on HDInsight 312

Azure Search 317

Summary 318

9. Intelligence and Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

Azure Machine Learning 322

R Server on HDInsight 324

SQL R Services 325

Microsoft Cognitive Services 326

Summary 338

10. Managing Metadata in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

Managing Metadata with Azure Data Catalog 339

Data Catalog in the Blue Yonder Airports Scenario 342

Add an Azure Data Lake Store Asset 344

Add Azure Storage Blobs 347

Add a SQL Data Warehouse 352

Summary 355

11. Protecting Your Data in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

Identity and Access Management 357

Data Protection 359

Auditing 361

Summary 362

12. Performing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

Analytics with Power BI 363

Real-Time Power BI in the Blue Yonder Scenario 365

Batch Analytics Reporting with Power BI in the Blue Yonder Scenario 374

A Look Ahead 378

Real Time 378

Lower Batch Latencies 379

IoT 379

Security 379

More Linux 379

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.