基于Kafka和Spark的实时数据质量监控平台

Slides:

Advertisements

Similar presentations

Web Role 的每台虚机运行有 IIS ，用于处理 Web 请求 Worker Role 用于运行后台进程 Cloud Service 是什么？支持多层架构的应用容器由多个 Windows 虚拟机集群构成集群有两种类型： Web 和 Worker Cloud Service 做什么进行应用的自动化部署.

Advertisements

“智慧城市”之我见北京中斗科技股份有限公司张升

藥物濫用華德學校上午校黃秀雯.

智能电力系统与智能电网（Smarter Power Systems and Smart Grid）

翰林版國文第三冊第六課《迢迢牽牛星》設計者：郭宜幸.

第一章　会计信息系统第一节　计算机会计概述.

系統分析與設計第九章資料設計.

DATE: 14/10/2009 陳威宇格網技術組雲端運算相關應用 (Based on Hadoop)

Big Data Ecosystem – Hadoop Distribution

第五章資訊科技基礎建設與新興科技.

香港扶貧計劃關愛基金 Group 5 組員馬曉真余葆董賽騫蕭雪兒.

第二章管理資訊系統概論暨資訊系統應用(Introduction to MIS and the Applications of IS)

网格及其应用的一些相关技术高能所计算中心于传松

怎样规划部署您的大数据应用系统大数据厂商联盟李永 VoltDB基础概念与架构 1.

Network Storage and System Virtualization Technology

一流的科技信息推动一流的科学研究 SCI数据库在科研中的价值与应用

信用卡資料庫管理與顧客服務玉山銀行陳炳良 2002年09月

第6章資料庫管理系統 6-1 關聯式資料庫管理系統 6-2 SQL Server資料庫管理系統

資料庫設計 Database Design.

第8章系統架構.

我的心得報告經過篩選，挑中我們十多位學生由學校推薦進入公司，開始他們的學習之旅學習的過程中有想像不到的意外驚喜

國立花蓮女中101學年度開學典禮簡報.

寻找适合您的工业4.0 Dell/曾峰.

2012 Federal Tax Return Due Date : 4/15/2013

大数据在医疗行业的应用.

桂小林西安交通大学电子与信息工程学院计算机科学与技术系

Azure Event Hub Survey 周琦.

利用LoadRunner进行性能测试.

作业效率分析 1. Performance 概念 2. PAC 3. 作业效率改善方案.

联想DSS并行存储张莫穷, 联想HPC团队

Wife Certificate Agenda Why Wi-Fi ? Install and operation chariot.

舞台劇在香港的前途.

Lotus Domino R7 Designer

軟體原型 (Software Prototyping)

王耀聰陳威宇國家高速網路與計算中心(NCHC)

kCloudStorage - 基于云技术的廉价冗余天文海量数据存储

第五讲数据的分组、合并与转换.

系統與網路管理工具.

Isilon中国区技术经理杨峰虚拟天文台年会存储技术交流 Isilon中国区技术经理杨峰 Isilon Proprietary and Confidential.

Decision Support System (靜宜資管楊子青)

“荷蘭經營經驗” 座談會流程問題經驗談

國立屏東高級工業職業學校雲端網路及雲端開系統介紹

第5章資料倉儲的資料建置.

The Analysis of Competitive Markets 競爭市場分析

品質管理系統華南品規課鴻准精密模具有限公司 2018/12/6.

EWB 电路电子分析设计仿真软件软件简介　　随着电子技术和计算机技术的发展，电子产品已与计算机紧密相连，电子产品的智能化日益完善，电路的集成度越来越高，而产品的更新周期却越来越短。电子设计自动化（EDA）技术，使得电子线路的设计人员能在计算机上完成电路的功能设计、逻辑设计、性能分析、时序测试直至印刷电路板的自动设计。EDA是在计算机辅助设计（CAD）技术的基础上发展起来的计算机设计软件系统。与早期的CAD软件相比，EDA软件的自动化程度更高、功能更完善、运行速度更快，而且操作界面

彭丰林王丹祁民沈晓阳张健黄清华中国虚拟地磁台建设构想 PENG Fenglin, WANG Dan, QI Min， SHEN Xiaoyang， HUANG Qinghua 彭丰林王丹祁民沈晓阳张健黄清华

Visual Studio Team System 简介

資訊系統文件化工具東吳大學會計學系謝永明.

第4章(1) 空间数据库 —数据库理论基础北京建筑工程学院王文宇.

Decision Support System (靜宜資管楊子青)

The Practical Issues of Sonar Image Processing

Sensor Networks: Applications and Services

Real-Time System Software Group Lab 408 Wireless Networking and Embedded Systems Laboratory Virtualization, Parallelization, Service 實驗室主要是以系統軟體設計為主,

Supply Chain Management

Apache Flink 刘驰.

OvidSP Introduction Flexible. Innovative. Precise.

实时计算平台及相关业务实践 Baidu.inf.dc ChaiHua .

Speaker : Chang Kai-Jia Date : 2010/04/26

美國亞利桑納州Eurofresh農場的晨曦

Outline Overview of this paper Motivation and Initialization

Print Security Audit System

严肃游戏设计—— Lab-Adventure

中華民國公立醫院協會年度秋季學術研討會精實醫療選項 vs 萬靈丹郭倉義中山大學企業管理學系中華民國公立醫院協會年度秋季學術研討會.

11 Overview Cloud Computing 2012 NTHU. CS Che-Rung Lee

大亚湾实验离线数据处理何苗中国科学院高能物理研究所 2017年6月6日中国科学院成都情报文献中心.

案例分析: THE NEXTGEN POS SYSTEM

基于 Spectrum Scale 的数据安全

Experimental Analysis of Distributed Graph Systems

高擴充高穩定高安全企業級資料管理平台 Report Builder概論錢曉明資策會資深講師台灣微軟資深講師.

Presentation transcript:

基于Kafka和Spark的实时数据质量监控平台邢国东资深产品经理@ Microsoft 微信 LinkedIn

改变中的微软

微软应用与服务集团(ASG) Microsoft Application and Service Group

ASG数据团队大数据平台数据分析

我们要解决什么问题数据质量监控中我们要解决什么问题？

数据流 Kafka as data bus Scalable pub/sub for NRT data streams Services Devices Interactive analytics Applications Kafka as data bus Scalable pub/sub for NRT data streams Batch Processing Streaming Processing

快速增长的实时数据 1.3 million ~1 trillion 3.5 petabytes 100 thousand 1,300 EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY

Kafka上下游的数据质量保证 Destination Destination Lost Data == Lost Money Producer Producer Producer Kafka HLC Producer Destination Producer Kafka HLC Producer Destination Producer Kafka HLC 100K QPS, 300 Gb per hour Producer Lost Data == Lost Money Data == Money Producer

工作原理简介

工作原理 3 个审计粒度文件层级(file) 批次层级(batch) 记录层级 (record level)

Metadata { “Action” : “Produced or Uploaded”, “ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” }

工作原理 – 数据与审计流 Data Center Kafka + HLC under audit File 1: Record1 Producer Data Center Kafka + HLC under audit Producer File 1: Record4 Record5 Record1 Record2 Record3 Destination 1 Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Audit system Produced: file 1: 3 records Produced: file 1: 5 records Uploaded: file 1: 3 records

数据时延的Kibana图表

数据完整性Kibana图表 3 lines Green how many records produced Blue: how many reached destination #1 Green: how many reached destination #2

发送Audit的代码 client.SendBondObject(audit); Create a client object Lastly Prepare audit object

查询统计信息的APIs

设计概述

数据监控系统设计需要达成的目标监控streaming数据的完整性和时延数据pipeline中，Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息：哪个DC, 机器, event/file发生问题超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

Transient storage (Kafka) 系统设计 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Kafka ensures high-availability We don’t want ingress/egress to stuck sending audit information

系统设计 Pre-aggregates data to 1-minute chunks Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Pre-aggregates data to 1-minute chunks Keeps track of duplicates and increments， handles late arrival, out-of-order data and fault tolerant

系统设计 Stores pre-aggregated data for reporting through DAS Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Stores pre-aggregated data for reporting through DAS Allows for NRT charts using Kibana

系统设计 Final reporting endpoint for consumers Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Final reporting endpoint for consumers Does destination have complete data for that time? Which files are missing?

高可靠性 Front End Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

高可靠性 DC1 DC2 Active-Active disaster recovery Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Front End Web Service Data Analysis Web Service Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) DC2 Active-Active disaster recovery Monitor for each key component

可信的质量监控 ElasticSearch Audit for audit Destination Producer Kafka HLC “Produced” audits “Uploaded” audits Destination Late arrival, out of order processing, duplication Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service

问题的诊断 Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Transient storage (Kafka) Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service When loss happens, rich diagnostics info are needed for ad hoc queries

问题的诊断 Store raw audits for diagnostics Destination Producer Kafka HLC “Produced” audits Front End Web Service “Uploaded” audits Audit data processing pipeline (Spark) Aggregated data storage (ElasticSearch) Data Analysis Web Service Transient storage (Kafka) Raw audit storage (Cosmos) Store raw audits for diagnostics

目标回顾监控streaming数据的完整性和时延数据pipeline中，Multi-producer, multi-stage, multi-destination数据流 In near real time 提供诊断信息：哪个DC, 机器, event/file发生问题超级稳定 99.9% 在线 Scale out 审计数据可信 SLA QPS Scale..

Footprint & SLA 40 executors 6 machines for Spark HA (3+3) 16 Kafka machines (8+8) 10 ElasticSearch machines (5+5) Total 72 machines across 2 DCs Linear scale out 99.9% availability 1 min E2E latency 50K QPS

版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1 Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced!

团队

欢迎加入我们微信 LinkedIn