1 Hadoop 與 HBase 之架設及應用 Cloud, Hadoop and HBase Hadoop 與 HBase 之架設及應用 Cloud, Hadoop and HBase Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung.

Slides:

Advertisements

Similar presentations

MMN Lab 未來教室與雲端化學習 Yueh-Min Huang Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan

Advertisements

第一组 Java 与云计算. Contents 云计算简介一二云计算实例三云计算在教育中的应用四.

13-1 人工智慧 13-2 雲端運算 13-3 感測網路與物聯網 13-4 生物資訊 13-5 計算機萬能嗎？

云计算辅助教学风云录黎加厚上海师范大学教育技术系 2010年8月9日.

云计算及安全 ——Cloud Computing & Cloud Security

How to prepare yourself for the upcoming Cloud Era

DATE: 14/10/2009 陳威宇格網技術組雲端運算相關應用 (Based on Hadoop)

Big Data Ecosystem – Hadoop Distribution

Foundations of Computer Science

第八讲基于Hadoop的数据仓库Hive （PPT版本号：2016年4月6日版本）

教育雲端科技的現況與未來發展臺北市政府教育局聘任督學韓長澤.

网格及其应用的一些相关技术高能所计算中心于传松

巨量資料平台： Hadoop的生態系.

CHAPTER 7 認識Hadoop.

台灣雲端運算應用實驗中心研發計畫計畫期間：自98年7月1日至99年6月30日止執行單位名稱：財團法人資訊工業策進會國立中山大學.

操作系统结构.

HADOOP的高能物理分析平台孙功星高能物理研究所/计算中心

为教师开展大数据课程教学提供全方位、一站式服务

基于hadoop的数据仓库技术.

大数据在医疗行业的应用.

Introduction to MapReduce

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

CHT Project Progress Report

當企鵝龍遇上小飛象 DRBL-Hadoop Jazz Wang Yao-Tsung Wang

YARN & MapReduce 2.0 Boyu Diao

高级软件工程云计算主讲：李祥 QQ: 年12月.

分布式系统中的关键概念及Hadoop的起源、架构、搭建

第2章大数据处理架构Hadoop （PPT版本号：2017年2月版本）

形式语言与网络计算环境构建 1.

異質計算教學課程內容「異質計算」種子教師研習營洪士灝國立台灣大學資訊工程學系

Introduction to Cloud Computing Services and its Applications

王耀聰陳威宇國家高速網路與計算中心(NCHC)

作業系統補充：雲端運算.

基于Hadoop的数据仓库Hive.

線上英檢測驗系統 Copyright © 2012 Cengage Learning Asia Pte. Ltd.,

Operating System Concepts 作業系統原理 CHAPTER 2 系統結構 (System Structures)

中国散裂中子源小角谱仪的实验数据格式与处理算法报告人：张晟恺中国科学院高能物理研究所 SCE 年8月18日

CHAPTER 6 認識MapReduce.

开源云计算系统简介电子工业出版社刘鹏主编《云计算》教材配套课件11.

Cloud Computing(雲端運算) 技術的現況與應用

斯巴達帶大家上雲端.

崑山科技大學曾龍資訊工程系系主任數位生活研究所所長雲端運算與資通安全研發中心主任

國立屏東高級工業職業學校雲端網路及雲端開系統介紹

SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES

信息产业导论期末汇报汇报人：刁梦鸽学号：时间：2012年5月31日.

软件工程基础云计算概论刘驰.

第4章(1) 空间数据库 —数据库理论基础北京建筑工程学院王文宇.

大数据介绍及应用案例分享 2016年7月华信咨询设计研究院有限公司.

服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies

Microsoft SQL Server 2008 報表服務_設計

TinyOS 石万兵 2019/4/6 mice.

資料結構 Data Structures Fall 2006， 95學年第一學期 Instructor : 陳宗正.

資料庫靜宜大學資管系楊子青.

高性能计算与天文技术联合实验室智能与计算学部天津大学

高正宗 System Consultant Manager

Unit 05 雲端分散式Hadoop實驗 -I M. S. Jian

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

中国科学技术大学计算机系陈香兰 Fall 2013 第三讲线程中国科学技术大学计算机系陈香兰 Fall 2013.

Common Qs Regarding Earnings

SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES

百万亿次超级计算机诞生记姓名 Xiangyu Ye 职务微软中国技术中心资深HPC顾问公司微软中国

第 18 章雲端計算.

11 Overview Cloud Computing 2012 NTHU. CS Che-Rung Lee

MGT 213 System Management Server的昨天，今天和明天

Introduction to Cloud Computing Services and its Applications

OrientX暑期工作总结及计划 XML Group

Experimental Analysis of Distributed Graph Systems

Section 1 Basic concepts of web page

Presentation transcript:

1 Hadoop 與 HBase 之架設及應用 Cloud, Hadoop and HBase Hadoop 與 HBase 之架設及應用 Cloud, Hadoop and HBase Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

2 Course Information 課程資訊講師介紹： – 國網中心王耀聰副研究員 / 交大電控碩士 – 所有投影片、參考資料與操作步驟均在網路上 – 由於雲端資訊變動太快，愛護地球，請減少不必要之講義列印。礙於缺乏實機操作環境，故以影片展示與單機操作為主 – 若有興趣實機操作，請參考國網中心雲端運算課程錄影 – – – 若需要實驗環境，可至國網中心雲端運算實驗叢集申請帳號 – Hadoop 相關問題討論： –

3 淺談雲端運算趨勢與關鍵技術 The trend of cloud computing and its core technologies 淺談雲端運算趨勢與關鍵技術 Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

4 什麼是雲端運算啊？ What is Cloud Computing ? 當紅「雲端運算」你瞭解了嗎？雲端產業 8 分鐘就上手

5 National Definition of Cloud Computing 美國國家標準局 NIST 給雲端運算所下的定義 National Definition of Cloud Computing 美國國家標準局 NIST 給雲端運算所下的定義 5 Characteristics 五大基礎特徵 4 Deployment Models 四個佈署模型 3 Service Models 三個服務模式 1. On-demand self-service. 隨需自助服務隨需自助服務 2. Broad network access 隨時隨地用任何網路裝置存取隨時隨地用任何網路裝置存取 3. Resource pooling 多人共享資源池多人共享資源池 4. Rapid elasticity 4. Rapid elasticity快速重新佈署靈活度快速重新佈署靈活度 5. Measured Service 可被監控與量測的服務可被監控與量測的服務

6 2 perspectives : Services vs Technologies 您想聽的是「雲端服務」還是「雲端技術」 ? 2 perspectives : Services vs Technologies 您想聽的是「雲端服務」還是「雲端技術」 ? Cloud computing hype spurs confusion, Gartner says 淺談雲端運算 (Cloud Computing) 雲端服務雲端技術

7 Source: The wisdom of Clouds (Crowds) 雲端序曲：雲端的智慧始終來自於群眾的智慧雲端序曲：雲端的智慧始終來自於群眾的智慧 2006 年 8 月 9 日 Google 執行長施密特（ Eric Schmidt ）於 SES'06 會議中首次使用「雲端運算（ Cloud Computing ）」來形容無所不在的網路服務 2006 年 8 月 9 日 Google 執行長施密特（ Eric Schmidt ）於 SES'06 會議中首次使用「雲端運算（ Cloud Computing ）」來形容無所不在的網路服務 2006 年 8 月 24 日 Amazon 以 Elastic Compute Cloud 命名其虛擬運算資源服務 2006 年 8 月 24 日 Amazon 以 Elastic Compute Cloud 命名其虛擬運算資源服務

8 行動版隨時存取 Mobile Cloud Service 行動版隨時存取 Mobile Cloud Service 網路版多人共享 Share Service Software 網路版多人共享 Share Service Software 單機版個人使用 Personal Software 單機版個人使用 Personal Software 實體Physical實體Physical Mobile Mail Web Mail 信箱Mailbox信箱Mailbox Mobile TV Web TV Ex. Youtube Web TV Ex. Youtube 電視盒 Setop Box 電視盒電視TV電視TV M-OfficeM-Office Google Docs OfficeOffice打字機 Typer Writer 打字機 Flash Wengo SkypeSkype數位電話PBX數位電話PBX電話Telephone電話Telephone 微網誌 Twitter 部落格 Blog 電子佈告欄BBS電子佈告欄BBS佈告欄 Bullet Borad 佈告欄 Evolution of Cloud Services 雲端服務只是軟體演化史的必然趨勢雲端服務只是軟體演化史的必然趨勢

9 Key Driving Forces of Cloud Computing 雲端運算的關鍵驅動力雲端運算的關鍵驅動力隨需行動服務 Mobile Service 隨需行動服務降低經營成本 Cost Down 降低經營成本因應資料爆炸 Data Explore 因應資料爆炸資料往雲擺減少資料傳輸資料往雲擺減少資料傳輸租賃取代買斷動態隨需付費租賃取代買斷動態隨需付費用任何連網裝置都可以存取資料用任何連網裝置都可以存取資料雲端

Data Explore Top 1 : Human Genomics – 7000 PB / Year Top 2 : Digital Photos – 1000 PB+/ Year Top 3 : (no Spam) – 300 PB+ / Year 2007 Data Explore Top 1 : Human Genomics – 7000 PB / Year Top 2 : Digital Photos – 1000 PB+/ Year Top 3 : (no Spam) – 300 PB+ / Year Source: Source:

11 「笨蛋！重點在經濟」（ "It's the economy, stupid" ）卡維爾（ James Carville ）自創這句標語，促使柯林頓當上美國第 42 屆總統。年「笨蛋！重點還是在經濟」（ "It's STILL the economy, stupid" ）卻讓小布希嘲笑是幼稚的總統。年雲端時代，谷歌會說：「笨蛋！重點在資料」（ "It's the data, stupid" ）誰掌握了你的資料，就有機會掌握你的荷包想想看，電腦、手機掉了，您心疼的是甚麽呢？年

12 Reference Cloud Architecture 雲端運算的參考架構雲端運算的參考架構 User-Level Middleware Core Middleware User-Level System Level Ia a S P aa S S aa S 硬體設施 Hardware Infrastructure: Computer, Storage, Network 虛擬化 Virtualization VM, VM management and Deployment 虛擬化 Virtualization VM, VM management and Deployment 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 應用軟體 Application Social Computing, Enterprise, ISV,… 應用軟體 Application Social Computing, Enterprise, ISV,…

13 Open Source to build Private Cloud 建構私有雲端的自由軟體建構私有雲端的自由軟體 Xen, KVM, VirtualBox, QEMU, OpenVZ,... Xen, KVM, VirtualBox, QEMU, OpenVZ,... OpenNebula, Enomaly, Eucalyptus, OpenQRM,... OpenNebula, Enomaly, Eucalyptus, OpenQRM,... Hadoop (MapReduce), Sector/Sphere, AppScale Hadoop (MapReduce), Sector/Sphere, AppScale eyeOS, Nutch, ICAS, X-RIME,... eyeOS, Nutch, ICAS, X-RIME,... 硬體設施 Hardware Infrastructure: Computer, Storage, Network 虛擬化 Virtualization VM, VM management and Deployment 虛擬化 Virtualization VM, VM management and Deployment 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 應用軟體 Application Social Computing, Enterprise, ISV,… 應用軟體 Application Social Computing, Enterprise, ISV,…

14 IaaS : Virtualization Virtualization PaaS : Big Data PaaS : Big Data 模組化基礎建設模組化基礎建設無所不在的運算無所不在的運算儲存等級記憶體儲存等級記憶體情境感知運算情境感知運算社交分析社交分析次世代分析次世代分析多媒體內容多媒體內容社交溝通協作社交溝通協作平板行動應用平板行動應用雲端運算雲端運算評價排行榜評價排行榜即時搜尋即時搜尋社交網路社交網路智慧裝置智慧裝置大量資訊分析大量資訊分析雲端運算雲端運算 SaaS : Web 2.0 SaaS : Web 2.0 雲端

15 PaaS : Big Data PaaS : Big Data SaaS : Web 2.0 SaaS : Web 2.0 IaaS : Virtualization Virtualization SaaS : Web 2.0 SaaS : Web 2.0 Two Type of Cloud Architecture ? 雲端架構的兩大陣營 ? Two Type of Cloud Architecture ? 雲端架構的兩大陣營 ? 想盡辦法誘你用計算跟網路 Computing Intensive 想盡辦法誘你提供資料作分析 Data Intensive

16 Building PaaS with Open Source 用自由軟體打造 PaaS 雲端服務 Building PaaS with Open Source 用自由軟體打造 PaaS 雲端服務 Xen, KVM, VirtualBox, QEMU, OpenVZ,... Xen, KVM, VirtualBox, QEMU, OpenVZ,... OpenNebula, Enomaly, Eucalyptus, OpenQRM,... OpenNebula, Enomaly, Eucalyptus, OpenQRM,... Hadoop (MapReduce), Sector/Sphere, AppScale Hadoop (MapReduce), Sector/Sphere, AppScale eyeOS, Nutch, ICAS, X-RIME,... eyeOS, Nutch, ICAS, X-RIME,... 硬體設施 Hardware Infrastructure: Computer, Storage, Network 虛擬化 Virtualization VM, VM management and Deployment 虛擬化 Virtualization VM, VM management and Deployment 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 控制管理 Control Qos Neqotiation, Ddmission Control, Pricing, SLA Management, Metering… 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 程式語言 Programming Web 2.0 介面, Mashups, Workflows, … 應用軟體 Application Social Computing, Enterprise, ISV,… 應用軟體 Application Social Computing, Enterprise, ISV,…

17 Three Core Technologies of Google.... Google 的三大關鍵技術.... Google 在一些會議分享他們的三大關鍵技術 Google shared their design of web-search engine – SOSP 2003 : – “The Google File System” – – OSDI 2004 : – “MapReduce : Simplifed Data Processing on Large Cluster” – – OSDI 2006 : – “Bigtable: A Distributed Storage System for Structured Data” –

18 Open Source Mapping of Google Core Technologies Google 三大關鍵技術對應的自由軟體 Hadoop Distributed File System (HDFS) Sector Distributed File System Hadoop Distributed File System (HDFS) Sector Distributed File System Hadoop MapReduce API Sphere MapReduce API,... Hadoop MapReduce API Sphere MapReduce API,... HBase, Hypertable Cassandra,.... HBase, Hypertable Cassandra,.... Google File System To store petabytes of data Google File System To store petabytes of data MapReduce To parallel process data MapReduce BigTable A huge key-value datastore BigTable 更多不同語言的 MapReduce API 實作：其他值得觀察的分散式檔案系統：  IBM GPFS -  Lustre -  C eph -

19 HadoopHadoop Hadoop 是 Apache Top Level 開發專案 Hadoop is Apache Top Level Project 目前主要由 Yahoo! 資助、開發與運用 Major sponsor is Yahoo! 創始者是 Doug Cutting ，參考 Google Filesystem Developed by Doug Cutting, Reference from Google Filesystem 以 Java 開發，提供 HDFS 與 MapReduce API 。 Written by Java, it provides HDFS and MapReduce API 2006 年使用在 Yahoo 內部服務中 Used in Yahoo since year 2006 已佈署於上千個節點。 It had been deploy to nodes in Yahoo 處理 Petabyte 等級資料量。 Design to process dataset in Petabyte Facebook, Last.fm, Joost, Twitter are also powered by Hadoop

20 Sector / Sphere 由美國資料探勘中心研發的自由軟體專案。 Developed by National Center for Data Mining, USA 採用 C/C++ 語言撰寫，因此效能較 Hadoop 更好。 Written by C/C++, so performance is better than Hadoop 提供「類似」 Google File System 與 MapReduce 的機制 Provide file system similar to Google File System and MapReduce API 基於 UDT 高效率網路協定來加速資料傳輸效率 Based on UDT which enhance the network performance Open Cloud Testbed 有提供測試環境，並開發 MalStone 效能評比軟體 Open Cloud Consortium provide Open Cloud Testbed and develop MalStone toolkit for benchmark

21 Hadoop in production run.... 商業運轉中的 Hadoop 應用.... September 30, 2008 Scaling Hadoop to 4000 nodes at Yahoo!

22 Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

23

24 Features of Hadoop... Hadoop 這套軟體的特色是... 海量 Vast Amounts of Data – 擁有儲存與處理大量資料的能力 – Capability to STORE and PROCESS vast amounts of data. 經濟 Cost Efficiency – 可以用在由一般 PC 所架設的叢集環境內 – Based on large clusters built of commodity hardware. 效率 Parallel Performance – 透過分散式檔案系統的幫助，以致得到快速的回應 – With the help of HDFS, Hadoop have better performance. 可靠 Robustness – 當某節點發生錯誤，能即時自動取得備份資料及佈署運算資源 – Robustness to add and remove computing and storage resource without shutdown entire system.

25 Founder of Hadoop – Doug Cutting Hadoop 這套軟體的創辦人 Doug Cutting Doug Cutting Talks About The Founding Of Hadoop

26 History of Hadoop … 2002~2004 Hadoop 這套軟體的歷史源起 ~2004 Lucene – – 用 Java 設計的高效能文件索引引擎 API – a high-performance, full-featured text search engine library written entirely in Java. – 索引文件中的每一字，讓搜尋的效率比傳統逐字比較還要高的多 – Lucene create an inverse index of every word in different documents. It enhance performance of text searching.

27 History of Hadoop … 2002~2004 Hadoop 這套軟體的歷史源起 ~2004 Nutch – –Nutch 是基於開放原始碼所開發的網站搜尋引擎 – Nutch is open source web-search software. – 利用 Lucene 函式庫開發 – It builds on Lucene and Solr, adding web- specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

28 Three Gifts from Google.... 來自 Google 的三個禮物.... Nutch 後來遇到儲存大量網站資料的瓶頸 Nutch encounter storage issue Google 在一些會議分享他們的三大關鍵技術 Google shared their design of web-search engine – SOSP 2003 : “The Google File System” – – OSDI 2004 : “MapReduce : Simplifed Data Processing on Large Cluster” – – OSDI 2006 : “Bigtable: A Distributed Storage System for Structured Data” –

29 History of Hadoop … 2004 ~ Now Hadoop 這套軟體的歷史源起 ~ Now Dong Cutting reference from Google's publication Added DFS & MapReduce implement to Nutch According to user feedback on the mail list of Nutch.... Hadoop became separated project since Nutch 0.8 Nutch DFS → Hadoop Distributed File System (HDFS) Yahoo hire Dong Cutting to build a team of web search engine at year – Only 14 team members (engineers, clusters, users, etc.) Doung Cutting joined Cloudera at year 2009.

30 Ticket Hadoop 這套軟體的起源紀錄年二月一日

31 Who Use Hadoop ?? 有哪些公司在用 Hadoop 這套軟體 ?? Yahoo is the key contributor currently. IBM and Google teach Hadoop in universities … The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth) – from

32 Hadoop in production run.... 商業運轉中的 Hadoop 應用.... February 19, 2008 Yahoo! Launches World's Largest Hadoop Production Application

33 Hadoop in production run.... 商業運轉中的 Hadoop 應用.... September 30, 2008 Scaling Hadoop to 4000 nodes at Yahoo!

34 Comparison between Google and Hadoop Google 與 Hadoop 的比較表

35 Why should we learn Hadoop ? 為何需要學習 Hadoop ?? 1. Data Explore 資訊大爆炸資訊大爆炸 3. Looking for Jobs 好找工作 !! 3. Looking for Jobs 好找工作 !! 2. Data Mining Tool 方便作資料探勘的工作方便作資料探勘的工作

36 Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

37 Two Key Elements of Operating System 作業系統兩大關鍵組成元素

38 Terminologies of Hadoop Hadoop 文件中的專業術語

39 Two Key Roles of HDFS HDFS 軟體架構的兩種關鍵角色名稱節點 NameNode 資料節點 DataNode Master Node Manage NameSpace of HDFS Control Permission of Read and Write Define the policy of Replication Audit and Record the NameSpace Single Point of Failure Worker Nodes Perform operation of Read and Write Execute the request of Replication Multiple Nodes

40 Two Key Roles of Job Scheduler 程序排程的兩種關鍵角色 JobTrackerTaskTracker Master Node Receive Jobs from Hadoop Clients Assigned Tasks to TaskTrackers Define Job Queuing Policy, Priority and Error Handling Single Point of Failure Worker Nodes Excute Mapper and Reducer Tasks Save Results and report task status Multiple Nodes

41 Different Roles of Hadoop Architecture Hadoop 軟體架構中的不同角色

42 Distributed Operating System of Hadoop Hadoop 建構成一個分散式作業系統 42 Linuux Java Linuux Java Linuux Java Data Task Data Task Data Task Namenode JobTracker Hadoop Node1 Node2Node3

43 About Hadoop Client... 不在雲裡的 Hadoop Client

44 What we learn today ? WHENWHEN WHOWHO WHATWHAT HOWHOW WHYWHY Hadoop 是 2004 年從 Nutch 分裂出來的專案 !! Hadoop became separate project since year 2004 !! Hadoop 是 2004 年從 Nutch 分裂出來的專案 !! Hadoop became separate project since year 2004 !! 始祖是 Doug Cutting ， Apache 社群支持， Yahoo 贊助 From Doug Cutting to Apache Community, Yahoo and more ! 始祖是 Doug Cutting ， Apache 社群支持， Yahoo 贊助 From Doug Cutting to Apache Community, Yahoo and more ! Hadoop 是運算海量資料的軟體平台 !! hadoop is a software platform to process vast amount of data!! Hadoop 是運算海量資料的軟體平台 !! hadoop is a software platform to process vast amount of data!! 建構在大型的個人電腦叢集之上 Install on large clusters built of commodity hardware !! 建構在大型的個人電腦叢集之上資料大爆炸、資料探勘、找工作 Data Explore, Data Mining, Jobs !! 資料大爆炸、資料探勘、找工作

45 Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

46 What is HDFS ?? 什麼是 HDFS ?? H adoop D istributed F ile S ystem – 實現類似 Google File System 分散式檔案系統 – Reference from Google File System. – 一個易於擴充的分散式檔案系統，目的為對大量資料進行分析 – A scalable distributed file system for large data analysis. – 運作於廉價的普通硬體上，又可以提供容錯功能 – based on commodity hardware with high fault-tolerant. – 給大量的用戶提供總體性能較高的服務 – It have better overall performance to serve large amount of users.

47 Features of HDFS... HDFS 的特色是... 硬體錯誤容忍能力 Fault Tolerance – 硬體錯誤是正常而非異常 – Failure is the norm rather than exception – 自動恢復或故障排除 – automatic recovery or report failure 串流式的資料存取 Streaming data access – 批次處理多於用戶交互處理 – Batch processing rather than interactive user access. – 高 Throughput 而非低 Latency – High aggregate data bandwidth (throughput)

48 Features of HDFS... HDFS 的特色是... 大規模資料集 Large data sets and files – 支援 Petabytes 等級的磁碟空間 – Support Petabytes size 一致性模型 Coherency Model – 一次寫入，多次存取 Write-once-read-many – 簡化一致性處理問題 This assumption simplifies coherency 在地運算 Data Locality – 到資料的節點上計算 > 將資料從遠端複製過來計算 – “move compute to data” > “move data to compute” 異質平台移植性 Heterogeneous – 即使硬體不同也可移植、擴充 – HDFS could be deployed on different hardware

Parallel Computing using NFS storage 使用 NFS 進行平行運算 NFS Server RAM NFS Server Disk NFS Server Bridge NFS Server CPU NFS Server NIC NFS Client NIC NFS Client RAM NFS Client Bridge NFS Client CPU Disk I/O Network I/O Bus I/O (2) Bus I/O (1)

Parallel Computing using HDFS 使用 HDFS 進行平行運算 TaskTracker RAM TaskTracker Bridge TaskTracker CPU NameNode RAM DataNode Local Disk JobTracker Bridge JobTracker CPU JobTracker NIC Disk I/O x N Node Network I/O Bus I/O (2) Bus I/O (1) TaskTracker NIC

51 How HDFS manage data... HDFS 如何管理資料...

52 Datanodes (the slaves) How does HDFS work... HDFS 如何運作... name:/users/joeYahoo/myFile - copies:2, blocks:{1,3} name:/users/bobYahoo/someData.gzip, copies:3, blocks:{2,4,5} Namenode (the master) Client Metadata I/O Path and Filename – Replication, blocks

53 file1 (1,3) file2 (2,4,5) Namenode Map tasks Reduce tasks JobTracker TT ask for task Block 1 TT Increase reliability and read bandwidth – robustness ： read replication while found any failure – High read bandwith ： distribute read （ but increase write bottlenet ） TT TaskTracker About Data locality... HDFS 如何達成在地運算...

54 About Fault Tolerance... HDFS 如何達成容錯機制... 資料完整性 Data integrity – checked with CRC32 – 用副本取代出錯資料 – Replcae corrupt block with replication one Heartbeat – Datanode send heartbeat to Namenode Metadata – FSImage 、 Editlog 為核心印象檔及日誌檔 – FSImage – core file system mapping image – Editlog – like. SQL transaction log – 多份儲存，當名稱節點故障時可以手動復原 – Multiple backups of FSImage and Editlog – Manually recovery while NameNode Fault 資料崩毀 Data Corrupt 網路或資料節點失效 Network Fault DataNode Fault 名稱節點錯誤 NameNode Fault

55 Coherency Model and Performance of HDFS HDFS 的一致性機制與效能... 檔案一致性機制 Coherency model of files – 刪除檔案＼新增寫入檔案＼讀取檔案皆由名稱節點負責 – NameNode handle the operation of write, read and delete. 巨量空間及效能機制 Large Data Set and Performance – 預設每個區塊大小以 64MB 為單位 – By default, the block size is 64MB – 大區塊可提高存取效率 – Bigger block size will enhance read performance – 檔案有可能大過一顆磁碟 – Single file stored on HDFS might be larger than single physical disk of DataNode. – 區塊均勻散佈各節點以分散讀取流量 – Fully distributed blocks increase throughput of reading.

56 POSIX like HDFS commands 與 POSIX 相似的操作指令...

Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

58 Divide and Conquer Algorithms 分而治之演算法 Example 4: The way to climb 5 steps stair within 2 steps each time. 眼前有五階樓梯，每次可踏上一階或踏上兩階，那麼爬完五階共有幾種踏法？ Ex : (1,1,1,1,1) or (1,2,1,1) Example 1: Example 2: Example 3:

59 What is MapReduce ?? 什麼是 MapReduce ?? MapReduce 是 Google 申請的軟體專利，主要用來處理大量資料 MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers. 啟發自函數編程中常用的 map 與 reduce 函數。 The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms – Map(...) : N → N Ex. [ 1,2,3,4 ] – (*2) -> [ 2,4,6,8 ] – Reduce(...): N → 1 [ 1,2,3,4 ] - (sum) -> 10 Logical view of MapReduce Map(k1, v1) -> list(k2, v2) Reduce(k2, list (v2)) -> list(k3, v3) Source:

60 Google's MapReduce Diagram Google 的 MapReduce 圖解

61 Google's MapReduce in Parallel Google 的 MapReduce 平行版圖解

62 merge split 0 split 1 split 2 input HDFS JobTracker 跟 NameNode 取得需要運算的 blocks map JobTracker 選數個 TaskTracker 來作 Map 運算，產生些中間檔案 sort/copy JobTracker 將中間檔案整合排序後，複製到需要的 TaskTracker 去 reduce JobTracker 派遣 TaskTracker 作 reduce part0 part1 output HDFS reduce 完後通知 JobTracker 與 Namenode 以產生 output How does MapReduce work in Hadoop Hadoop MapReduce 運作流程

63 I am a tiger, you are also a tiger I am a tiger you are also a tiger a,2 also,1 am,1 are,1 I,1 tiger,2 you,1 reduce JobTracker 再選一個 TaskTracker 作 reduce JobTracker 先選了三個 Tracker 做 map map I,1 you,1 tiger,1 a,1 also,1 are,1 am,1 sort & shuffle Map 結束後， hadoop 進行中間資料的重組與排序 I,1 you (1) tiger(1,1) a (1,1) also (1) are (1) am,1 MapReduce by Example (1) MapReduce 運作實例 (1)

64 ` MapReduce by Example (2) MapReduce 運作實例 (2) a b c d sqrt(a + b) sqrt(c + d) ? // A[0][1] = // A[0][1] = // A[0][2] = // A[1][0] = // A[1][1] = // A[1][2] = // A[2][0] = // A[2][1] = // A[2][2] = 1.0 ma p Input File ma p (0,1.0) (0,0.0) (0,3.0) (1,3.2) (1,0.8) (1,32.0) (2,1.0) (2,14.0) (2,1.0) (0,{1.0,0.0,3.0}) (1,{3.2,0.8,32.0}) (2,{1.0,14.0,1.0}) sort / merge (0,sqrt( )) (1,sqrt( )) (2,sqrt( )) reduce

65 MapReduce is suitable to.... MapReduce 合適用於.... Text tokenization Indexing and Search Data mining machine learning … 大規模資料集 Large Data Set 可拆解 Parallelization

Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

67 可以跟資料庫結合嘛？ Can Hadoop work with Databases ? 總不能全部都重新設計吧？如何與舊系統相容？ Can Hadoop work with existing software ? Hadoop 只支援用 Java 開發嘛？ Is Hadoop only support Java ? 開發者們有聽到大家的需求..... Yes, we hear the feedback of developers...

68 Is Hadoop only support Java ? Although the Hadoop framework is implemented in Java TM, Map/Reduce applications need not be written in Java. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map/Reduce applications (non JNI TM based).

69 Hadoop Pipes (C++, Python) Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. The C++ interface is "swigable" so that interfaces can be generated for python and other scripting languages. For more detail, check the API Document of org.apache.hadoop.mapred.pipes org.apache.hadoop.mapred.pipes You can also find example code at hadoop-*/src/examples/pipes About the pipes C++ WordCount example code:

70 Hadoop Streaming Hadoop Streaming is a utility which allows users to create and run Map-Reduce jobs with any executables (e.g. Unix shell utilities) as the mapper and/or the reducer. It's useful when you need to run existing program written in shell script, perl script or even PHP. Note: both the mapper and the reducer are executables that read the input from STDIN (line by line) and emit the output to STDOUT. For more detail, check the official document of Hadoop Streaming Hadoop Streaming

71 Running Hadoop Streaming hadoop jar hadoop-streaming.jar hadoop jar hadoop-streaming.jar -help 10/08/11 00:20:00 ERROR streaming.StreamJob: Missing required option -input Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input DFS input file(s) for the Map step -output DFS output directory for the Reduce step -mapper The streaming command to run -combiner Combiner has to be a Java class -reducer The streaming command to run -file File/dir to be shipped in the Job jar file -dfs |local Optional. Override DFS configuration -jt |local Optional. Override JobTracker configuration -additionalconfspec specfile Optional. -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. … More …

72 Hadoop Streaming with shell commands (1) hadoop:~$ hadoop fs -rmr input output hadoop:~$ hadoop fs -put /etc/hadoop/conf input hadoop:~$ hadoop jar hadoop-streaming.jar -input input -output output -mapper /bin/cat -reducer /usr/bin/wc

73 Hadoop Streaming with shell commands (2) hadoop:~$ echo "sed -e \"s/ /\n/g\" | grep." > streamingMapper.sh hadoop:~$ echo "uniq -c | awk '{print \$2 \"\t\" \$1}'" > streamingReducer.sh hadoop:~$ chmod a+x streamingMapper.sh hadoop:~$ chmod a+x streamingReducer.sh hadoop:~$ hadoop fs -put /etc/hadoop/conf input hadoop:~$ hadoop jar hadoop-streaming.jar -input input -output output -mapper streamingMapper.sh -reducer streamingReducer.sh -file streamingMapper.sh -file streamingReducer.sh

74 There are serveral Hadoop subprojects Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters.

75 Other Hadoop related projects Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications.

76 Hadoop Ecosystem Hadoop Core (Hadoop Common) Avro ZooKeeperHDFSMapReduce HBaseHiveChukwaPig Source: Hadoop: The Definitive Guide

77 Avro Avro is a data serialization system. It provides: – Rich data structures. – A compact, fast, binary data format. – A container file, to store persistent data. – Remote procedure call (RPC). – Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. For more detail, please check the official document:

78 Zoo Keeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

79 Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: – Ease of programming – Optimization opportunities – Extensibility

80 Hive Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. Hive QL is based on SQL and enables users familiar with SQL to query this data.

81 Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. built on top of HDFS and Map/Reduce framework includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

82 Mahout Mahout is a scalable machine learning libraries. implemented on top of Apache Hadoop using the map/reduce paradigm. Mahout currently has – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – More...

83 Hadoop 與 HBase 簡易安裝（單機模式） Hadoop4Win : an Easy Way to install Hadoop and HBase on Windows Hadoop 與 HBase 簡易安裝（單機模式） Hadoop4Win : an Easy Way to install Hadoop and HBase on Windows Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang

84

85 Questions? Slides - Questions? Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang