Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Enterprise Analytics Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Analytics Data Pipeline 1
Data Lakes 2
Lambda Architecture 3
Kappa Architecture 5
Choosing Between Lambda and Kappa 6
The Azure Analytics Pipeline 6
Introducing the Analytics Scenarios 9
Example Code and Example Data Sets 11
What You Will Need 11
Broadband Internet Connectivity 11
Azure Subscription 11
Visual Studio 2015 with Update 1 11
Azure SDK 2.8 or Later 15
Summary 16
2. Getting Data into Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Ingest Loading Layer 17
Bulk Data Loading 19
Disk Shipping 19
End User Tools 35
Network-Oriented Approaches 52
Stream Loading 74
Stream Loading with Event Hubs 75
iii
Summary 76
3. Storing Ingested Data in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
File-Oriented Storage 77
Blob Storage 79
Azure Data Lake Store 84
HDFS 90
Queue-Oriented Storage 94
Blue Yonder Scenario: Smart Buildings 95
Event Hubs 96
IoT Hub 111
Summary 122
4. Real-Time Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Stream Processing 123
Consuming Messages from Event Hubs 125
Tuple-at-a-Time Processing in Azure 129
Introducing HDInsight 129
Storm on HDInsight 129
EventProcessorHost 170
Azure Machine Learning 174
Summary 174
5. Real-Time Micro-Batch Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Micro-Batch Processing in Azure 175
Spark Streaming on HDInsight 175
Storm on HDInsight 192
Azure Stream Analytics 199
Summary 206
6. Batch Processing in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Batch Processing with MapReduce on HDInsight 209
Apache Hadoop MapReduce 210
Batch Processing with Hive on HDInsight 213
Internal and External Tables 214
Partitioning Tables 214
Views 215
Indexes 215
Databases 216
Using Hive on HDInsight 216
Storage on HDInsight 218
Batch Processing Blue Yonder Airports Data 219
iv | Table of Contents
Creating an External Table 220
Creating an Internal Table 225
Batch Processing with Pig on HDInsight 228
Batch Processing with Spark on HDInsight 229
Batch Processing Blue Yonder Airports Data 232
Creating an External Table 233
Batch Processing with SQL Data Warehouse 237
Using SQL Data Warehouse 240
Batch Processing Blue Yonder Airports Data 240
Storing the Credentials to Azure Storage 241
Batch Processing with Data Lake Analytics 247
Using Data Lake Analytics 249
Batch Processing Blue Yonder Airports Data 250
Processing with U-SQL 250
Batch Processing with Azure Batch 258
Orchestrating Batch Processing Pipelines with Azure Data Factory 259
Summary 260
7. Interactive Querying in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Interactive Querying with Azure SQL Data Warehouse 263
Partitions and Distributions 263
Indexes 265
Interactive Exploration of the Blue Yonder Airports Data 266
Interactive Querying with Hive and Tez 269
Indexes 271
Partitions 271
Interactive Exploration of the Blue Yonder Airports Data 271
Interactive Querying with Spark SQL 278
Indexes 278
Partitions 278
Interactive Exploration of the Blue Yonder Airports Data 279
Interactive Querying with USQL 283
Interactive Exploration of the Blue Yonder Airports Data 283
Summary 285
8. Hot and Cold Path Serving Layer in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Azure Redis Cache 290
Redis in the Speed Serving Layer 291
Document DB 296
Document DB in the Speed Serving Layer 299
Document DB in the Batch Serving Layer 302
SQL Database 303
Table of Contents | v
SQL Database in the Speed Serving Layer 305
SQL Database in the Batch Serving Layer 311
SQL Data Warehouse 311
HBase on HDInsight 312
Azure Search 317
Summary 318
9. Intelligence and Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Azure Machine Learning 322
R Server on HDInsight 324
SQL R Services 325
Microsoft Cognitive Services 326
Summary 338
10. Managing Metadata in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Managing Metadata with Azure Data Catalog 339
Data Catalog in the Blue Yonder Airports Scenario 342
Add an Azure Data Lake Store Asset 344
Add Azure Storage Blobs 347
Add a SQL Data Warehouse 352
Summary 355
11. Protecting Your Data in Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Identity and Access Management 357
Data Protection 359
Auditing 361
Summary 362
12. Performing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Analytics with Power BI 363
Real-Time Power BI in the Blue Yonder Scenario 365
Batch Analytics Reporting with Power BI in the Blue Yonder Scenario 374
A Look Ahead 378
Real Time 378
Lower Batch Latencies 379
IoT 379
Security 379
More Linux 379