基于Kafka和Spark的实时数据质量监控平台 邢国东 资深产品经理@ Microsoft 微信 LinkedIn
改变中的微软
微软应用与服务集团(ASG) Microsoft Application and Service Group
ASG数据团队 大数据平台 数据分析
我们要解决什么问题 数据质量监控中我们要解决什么问题?
数据流 Kafka as data bus Scalable pub/sub for NRT data streams Services Devices Interactive analytics Applications Kafka as data bus Scalable pub/sub for NRT data streams Batch Processing Streaming Processing
快速增长的实时数据 1.3 million ~1 trillion 3.5 petabytes 100 thousand 1,300 EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY
Kafka上下游的数据质量保证 Destination Destination Lost Data == Lost Money Producer Producer Producer Kafka HLC Producer Destination Producer Kafka HLC Producer Destination Producer Kafka HLC 100K QPS, 300 Gb per hour Producer Lost Data == Lost Money Data == Money Producer
工作原理简介
工作原理 3 个审计粒度 文件层级(file) 批次层级(batch) 记录层级 (record level)
Metadata { “Action” : “Produced or Uploaded”, “ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” }
工作原理 – 数据与审计流 Data Center Kafka + HLC under audit File 1: Record1 Producer Data Center Kafka + HLC under audit Producer File 1: Record4 Record5 Record1 Record2 Record3 Destination 1 Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Audit system Produced: file 1: 3 records Produced: file 1: 5 records Uploaded: file 1: 3 records
数据时延的Kibana图表
数据完整性Kibana图表 3 lines Green how many records produced Blue: how many reached destination #1 Green: how many reached destination #2
发送Audit的代码 client.SendBondObject(audit); Create a client object Lastly Prepare audit object
查询统计信息的APIs
设计概述
数据监控系统设计需要达成的目标 监控streaming数据的完整性和时延 数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..
Transient storage (Kafka) 系统设计 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Kafka ensures high-availability We don’t want ingress/egress to stuck sending audit information
系统设计 Pre-aggregates data to 1-minute chunks Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Pre-aggregates data to 1-minute chunks Keeps track of duplicates and increments, handles late arrival, out-of-order data and fault tolerant
系统设计 Stores pre-aggregated data for reporting through DAS Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Stores pre-aggregated data for reporting through DAS Allows for NRT charts using Kibana
系统设计 Final reporting endpoint for consumers Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Final reporting endpoint for consumers Does destination have complete data for that time? Which files are missing?
高可靠性 Front End Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service
高可靠性 DC1 DC2 Active-Active disaster recovery Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Front End Web Service Data Analysis Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) DC2 Active-Active disaster recovery Monitor for each key component
可信的质量监控 ElasticSearch Audit for audit Destination Producer Kafka HLC “Produced” audits “Uploaded” audits Destination Late arrival, out of order processing, duplication Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service
问题的诊断 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service When loss happens, rich diagnostics info are needed for ad hoc queries
问题的诊断 Store raw audits for diagnostics Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Transient storage (Kafka) Raw audit storage (Cosmos) Store raw audits for diagnostics
目标回顾 监控streaming数据的完整性和时延 数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..
Footprint & SLA 40 executors 6 machines for Spark HA (3+3) 16 Kafka machines (8+8) 10 ElasticSearch machines (5+5) Total 72 machines across 2 DCs Linear scale out 99.9% availability 1 min E2E latency 50K QPS
版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1 Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced!
团队
欢迎加入我们 微信 LinkedIn