基于Kafka和Spark的实时数据质量监控平台

基于Kafka和Spark的实时数据质量监控平台
邢国东 Microsoft 微信 LinkedIn

改变中的微软

微软应用与服务集团(ASG) Microsoft Application and Service Group

ASG数据团队大数据平台数据分析

我们要解决什么问题数据质量监控中我们要解决什么问题？

数据流 Kafka as data bus Scalable pub/sub for NRT data streams Services
Devices Interactive analytics Applications Kafka as data bus Scalable pub/sub for NRT data streams Batch Processing Streaming Processing

快速增长的实时数据 1.3 million ~1 trillion 3.5 petabytes 100 thousand 1,300
EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY

Kafka上下游的数据质量保证 Destination Destination Lost Data == Lost Money
Producer Producer Producer Kafka HLC Producer Destination Producer Kafka HLC Producer Destination Producer Kafka HLC 100K QPS, 300 Gb per hour Producer Lost Data == Lost Money Data == Money Producer

工作原理简介

工作原理 3 个审计粒度文件层级(file) 批次层级(batch) 记录层级 (record level)

Metadata { “Action” : “Produced or Uploaded”,
“ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” }

工作原理 – 数据与审计流 Data Center Kafka + HLC under audit File 1: Record1
Producer Data Center Kafka + HLC under audit Producer File 1: Record4 Record5 Record1 Record2 Record3 Destination 1 Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Audit system Produced: file 1: 3 records Produced: file 1: 5 records Uploaded: file 1: 3 records

数据时延的Kibana图表

数据完整性Kibana图表 3 lines Green how many records produced
Blue: how many reached destination #1 Green: how many reached destination #2

发送Audit的代码 client.SendBondObject(audit); Create a client object Lastly
Prepare audit object

查询统计信息的APIs

设计概述

数据监控系统设计需要达成的目标监控streaming数据的完整性和时延
数据pipeline中，Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息：哪个DC, 机器, event/file发生问题超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

Transient storage (Kafka)
系统设计 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Kafka ensures high-availability We don’t want ingress/egress to stuck sending audit information

系统设计 Pre-aggregates data to 1-minute chunks
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Pre-aggregates data to 1-minute chunks Keeps track of duplicates and increments， handles late arrival, out-of-order data and fault tolerant

系统设计 Stores pre-aggregated data for reporting through DAS
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Stores pre-aggregated data for reporting through DAS Allows for NRT charts using Kibana

系统设计 Final reporting endpoint for consumers
Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Final reporting endpoint for consumers Does destination have complete data for that time? Which files are missing?

高可靠性 Front End Web Service Transient storage (Kafka)
Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

高可靠性 DC1 DC2 Active-Active disaster recovery
Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Front End Web Service Data Analysis Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) DC2 Active-Active disaster recovery Monitor for each key component

可信的质量监控 ElasticSearch Audit for audit Destination Producer Kafka HLC
“Produced” audits “Uploaded” audits Destination Late arrival, out of order processing, duplication Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

问题的诊断 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service When loss happens, rich diagnostics info are needed for ad hoc queries

问题的诊断 Store raw audits for diagnostics Destination Producer Kafka HLC
“Produced” audits Front End Web Service “Uploaded” audits Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Transient storage (Kafka) Raw audit storage (Cosmos) Store raw audits for diagnostics

目标回顾监控streaming数据的完整性和时延
数据pipeline中，Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息：哪个DC, 机器, event/file发生问题超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

Footprint & SLA 40 executors 6 machines for Spark HA (3+3)
16 Kafka machines (8+8) 10 ElasticSearch machines (5+5) Total 72 machines across 2 DCs Linear scale out 99.9% availability 1 min E2E latency 50K QPS

版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1
Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced!

团队

欢迎加入我们微信 LinkedIn

基于Kafka和Spark的实时数据质量监控平台

Similar presentations

Presentation on theme: "基于Kafka和Spark的实时数据质量监控平台"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

基于Kafka和Spark的实时数据质量监控平台

Similar presentations

Presentation on theme: "基于Kafka和Spark的实时数据质量监控平台"— Presentation transcript:

Similar presentations

About project

反馈