Download presentation
Presentation is loading. Please wait.
1
基于Kafka和Spark的实时数据质量监控平台
邢国东 Microsoft 微信 LinkedIn
2
改变中的微软
3
微软应用与服务集团(ASG) Microsoft Application and Service Group
4
ASG数据团队 大数据平台 数据分析
5
我们要解决什么问题 数据质量监控中我们要解决什么问题?
6
数据流 Kafka as data bus Scalable pub/sub for NRT data streams Services
Devices Interactive analytics Applications Kafka as data bus Scalable pub/sub for NRT data streams Batch Processing Streaming Processing
7
快速增长的实时数据 1.3 million ~1 trillion 3.5 petabytes 100 thousand 1,300
EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY
8
Kafka上下游的数据质量保证 Destination Destination Lost Data == Lost Money
Producer Producer Producer Kafka HLC Producer Destination Producer Kafka HLC Producer Destination Producer Kafka HLC 100K QPS, 300 Gb per hour Producer Lost Data == Lost Money Data == Money Producer
9
工作原理简介
10
工作原理 3 个审计粒度 文件层级(file) 批次层级(batch) 记录层级 (record level)
11
Metadata { “Action” : “Produced or Uploaded”,
“ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” }
12
工作原理 – 数据与审计流 Data Center Kafka + HLC under audit File 1: Record1
Producer Data Center Kafka + HLC under audit Producer File 1: Record4 Record5 Record1 Record2 Record3 Destination 1 Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Audit system Produced: file 1: 3 records Produced: file 1: 5 records Uploaded: file 1: 3 records
13
数据时延的Kibana图表
14
数据完整性Kibana图表 3 lines Green how many records produced
Blue: how many reached destination #1 Green: how many reached destination #2
15
发送Audit的代码 client.SendBondObject(audit); Create a client object Lastly
Prepare audit object
16
查询统计信息的APIs
17
设计概述
18
数据监控系统设计需要达成的目标 监控streaming数据的完整性和时延
数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..
19
Transient storage (Kafka)
系统设计 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Kafka ensures high-availability We don’t want ingress/egress to stuck sending audit information
20
系统设计 Pre-aggregates data to 1-minute chunks
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Pre-aggregates data to 1-minute chunks Keeps track of duplicates and increments, handles late arrival, out-of-order data and fault tolerant
21
系统设计 Stores pre-aggregated data for reporting through DAS
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Stores pre-aggregated data for reporting through DAS Allows for NRT charts using Kibana
22
系统设计 Final reporting endpoint for consumers
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Final reporting endpoint for consumers Does destination have complete data for that time? Which files are missing?
23
高可靠性 Front End Web Service Transient storage (Kafka)
Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service
24
高可靠性 DC1 DC2 Active-Active disaster recovery
Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Front End Web Service Data Analysis Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) DC2 Active-Active disaster recovery Monitor for each key component
25
可信的质量监控 ElasticSearch Audit for audit Destination Producer Kafka HLC
“Produced” audits “Uploaded” audits Destination Late arrival, out of order processing, duplication Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service
26
问题的诊断 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service When loss happens, rich diagnostics info are needed for ad hoc queries
27
问题的诊断 Store raw audits for diagnostics Destination Producer Kafka HLC
“Produced” audits Front End Web Service “Uploaded” audits Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Transient storage (Kafka) Raw audit storage (Cosmos) Store raw audits for diagnostics
28
目标回顾 监控streaming数据的完整性和时延
数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..
29
Footprint & SLA 40 executors 6 machines for Spark HA (3+3)
16 Kafka machines (8+8) 10 ElasticSearch machines (5+5) Total 72 machines across 2 DCs Linear scale out 99.9% availability 1 min E2E latency 50K QPS
30
版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1
Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced!
31
团队
32
欢迎加入我们 微信 LinkedIn
Similar presentations