基于Kafka和Spark的实时数据质量监控平台

Similar presentations


Presentation on theme: "基于Kafka和Spark的实时数据质量监控平台"— Presentation transcript:

1 基于Kafka和Spark的实时数据质量监控平台
邢国东 Microsoft 微信 LinkedIn

2 改变中的微软

3 微软应用与服务集团(ASG) Microsoft Application and Service Group

4 ASG数据团队 大数据平台 数据分析

5 我们要解决什么问题 数据质量监控中我们要解决什么问题?

6 数据流 Kafka as data bus Scalable pub/sub for NRT data streams Services
Devices Interactive analytics Applications Kafka as data bus Scalable pub/sub for NRT data streams Batch Processing Streaming Processing

7 快速增长的实时数据 1.3 million ~1 trillion 3.5 petabytes 100 thousand 1,300
EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY

8 Kafka上下游的数据质量保证 Destination Destination Lost Data == Lost Money
Producer Producer Producer Kafka HLC Producer Destination Producer Kafka HLC Producer Destination Producer Kafka HLC 100K QPS, 300 Gb per hour Producer Lost Data == Lost Money Data == Money Producer

9 工作原理简介

10 工作原理 3 个审计粒度 文件层级(file) 批次层级(batch) 记录层级 (record level)

11 Metadata { “Action” : “Produced or Uploaded”,
“ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” }

12 工作原理 – 数据与审计流 Data Center Kafka + HLC under audit File 1: Record1
Producer Data Center Kafka + HLC under audit Producer File 1: Record4 Record5 Record1 Record2 Record3 Destination 1 Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Audit system Produced: file 1: 3 records Produced: file 1: 5 records Uploaded: file 1: 3 records

13 数据时延的Kibana图表

14 数据完整性Kibana图表 3 lines Green how many records produced
Blue: how many reached destination #1 Green: how many reached destination #2

15 发送Audit的代码 client.SendBondObject(audit); Create a client object Lastly
Prepare audit object

16 查询统计信息的APIs

17 设计概述

18 数据监控系统设计需要达成的目标 监控streaming数据的完整性和时延
数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

19 Transient storage (Kafka)
系统设计 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Kafka ensures high-availability We don’t want ingress/egress to stuck sending audit information

20 系统设计 Pre-aggregates data to 1-minute chunks
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Pre-aggregates data to 1-minute chunks Keeps track of duplicates and increments, handles late arrival, out-of-order data and fault tolerant

21 系统设计 Stores pre-aggregated data for reporting through DAS
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Stores pre-aggregated data for reporting through DAS Allows for NRT charts using Kibana

22 系统设计 Final reporting endpoint for consumers
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Final reporting endpoint for consumers Does destination have complete data for that time? Which files are missing?

23 高可靠性 Front End Web Service Transient storage (Kafka)
Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

24 高可靠性 DC1 DC2 Active-Active disaster recovery
Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Front End Web Service Data Analysis Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) DC2 Active-Active disaster recovery Monitor for each key component

25 可信的质量监控 ElasticSearch Audit for audit Destination Producer Kafka HLC
“Produced” audits “Uploaded” audits Destination Late arrival, out of order processing, duplication Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

26 问题的诊断 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service When loss happens, rich diagnostics info are needed for ad hoc queries

27 问题的诊断 Store raw audits for diagnostics Destination Producer Kafka HLC
“Produced” audits Front End Web Service “Uploaded” audits Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Transient storage (Kafka) Raw audit storage (Cosmos) Store raw audits for diagnostics

28 目标回顾 监控streaming数据的完整性和时延
数据pipeline中,Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息:哪个DC, 机器, event/file发生问题 超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

29 Footprint & SLA 40 executors 6 machines for Spark HA (3+3)
16 Kafka machines (8+8) 10 ElasticSearch machines (5+5) Total 72 machines across 2 DCs Linear scale out 99.9% availability 1 min E2E latency 50K QPS

30 版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1
Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced!

31 团队

32 欢迎加入我们 微信 LinkedIn


Download ppt "基于Kafka和Spark的实时数据质量监控平台"

Similar presentations


Ads by Google