Hadoop MapReduce http://hadoop.apache.org/common/docs/r0.17.0/mapred_tutorial.html Hadoop Map-Reduce is a software framework for easily writing applications.

Slides:

Advertisements

Similar presentations

大数据基础技术和应用. 大纲大数据概述大数据基础技术工程技术策略技术典型应用我们处于数据爆炸的时代数据库文字记录照片线下数据信息化网页数据用户行为记录数字图像互联网 - 移动互联网设备监控智能家居摄像头传感器地球上至今总共的数据量：在 2006 年，个人用户才刚刚迈.

Advertisements

软件编程基础一、程序的编辑 Java 源程序是以 Java 为后缀的简单的文本文件，可以用各种 Java 集成开发环境中的源代码编辑器来编写，也可以用其他文本编辑工具，如 Windows 中的记事本或 DOS 中的 EDIT 软件等。利用文字编辑器编写下列程序 public class Hello.

1 Java 语言程序设计计算机系鲍金玲. 2 引子甲骨文甲骨文是全球最大的信息管理软件及服务供应商，成立于 1977 年，公司总部设在美国加利福尼亚州的红木城（ Redwood Shores ），全球员工达名，包括名开发人员、 7500 多名技术支持人员和.

3.2 Java的类 Java 类库的概念语言规则——程序的书写规范 Java语言类库——已有的有特定功能的Java程序模块

第一單元建立java 程式.

第八讲基于Hadoop的数据仓库Hive （PPT版本号：2016年4月6日版本）

Java的程式架構與基本觀念 Java語言的歷史 Java程式的開發環境 Java程式的架構輸出與輸入物件之使用工具使用方法介紹

四資二甲第三週作業物件導向程式設計.

基于Hadoop的Map/Reduce框架研究报告

Ch.8. 基于MapReduce的图算法 MapReduce海量数据并行处理

台灣雲端運算應用實驗中心研發計畫計畫期間：自98年7月1日至99年6月30日止執行單位名稱：財團法人資訊工業策進會國立中山大學.

云梯的多namenode和跨机房之路

HADOOP的高能物理分析平台孙功星高能物理研究所/计算中心

数据采集与Hadoop框架报告人：黄文君导师：王华忠 BEA Confidential.

为教师开展大数据课程教学提供全方位、一站式服务

基于hadoop的数据仓库技术.

Hadoop与数据分析淘宝数据平台及产品部基础研发组周敏日期：

Map-Reduce Programming

第二章 JAVA语言基础.

一种基于Hadoop的视频大数据分布式解码方法冯强

Introduction to MapReduce

Frontiers of Software Engineering

软件工程基础 Hadoop生态系统刘驰.

王耀聰陳威宇國家高速網路與計算中心(NCHC)

設置Hadoop環境王耀聰陳威宇楊順發國家高速網路與計算中心(NCHC)

YARN & MapReduce 2.0 Boyu Diao

Hadoop 單機設定與啟動 step 1. 設定登入免密碼 step 2. 安裝java step 3. 下載安裝Hadoop

——Computing 2.0 By Barry.Cswords

Ch08 巢狀類別物件導向程式設計(II).

程式設計實作.

第3章分布式文件系统HDFS （PPT版本号：2017年2月版本）

2.1 基本資料型別 2.2 變數 2.3 運算式與運算子 2.4 輸出與輸入資料 2.5 資料型別轉換 2.6 實例

Operating System Concepts 作業系統原理 Chapter 3 行程觀念 (Process Concept)

王耀聰陳威宇國家高速網路與計算中心(NCHC)

Hadoop I/O By ShiChaojie.

基于Hadoop的数据仓库Hive.

實現雲端運算 Hadoop HDFS 磁碟及記憶體之即時分級服務

Applied Operating System Concepts

《大数据技术原理与应用》第七章 MapReduce （2016春季学期）林子雨厦门大学计算机科学系主页：

2018/11/20 第一章 Java概述武汉大学计算机学院计算机应用系 2018/11/20 14:33.

實作輔導日期: 3/11 09:10~16:00 地點:臺北市立大學臺北市中正區愛國西路一號 (中正紀念堂站7號出口)

Unit 06 雲端分散式Hadoop實驗 -II

CHAPTER 6 認識MapReduce.

Spark在智慧图书馆建设中的应用探索 2017年12月22日.

厦门大学数据库实验室 MapReduce 连接

Chapter 3 行程觀念 (Process Concept)

程式設計實作.

Cloud Computing MapReduce进阶.

Map Reduce Programming

从TDW-Hive到TDW-SparkSQL

Skew Join相关论文报告人：蔡珉星厦大数据库实验室

Homework 1(上交时间：10月14号) 倒排索引.

Echo Server/Client Speaker：Fang.

C/C++/Java 哪些值不是头等程序对象

第一單元建立java 程式.

TinyOS 石万兵 2019/4/6 mice.

Hadoop与数据分析淘宝数据平台及产品部基础研发组周敏日期：

JAVA 编程技术主编贾振华 2010年1月.

Unit 05 雲端分散式Hadoop實驗 -I M. S. Jian

虚拟仪器 virtual instrument

Apache Flink 刘驰.

密级：亿赞普Hadoop应用浅析 IZP 肖燕京.

Java程式初體驗大綱大綱在學程式之前及本書常用名詞解釋 Hello Java!程式在Dos下編譯、執行程式

Interfaces and Packages

基于MapReduce的Join算法优化

第二章 Java基本语法讲师：复凡.

JAVA 程式設計與資料結構第三章物件的設計.

InputStreamReader Console Scanner

Presentation transcript:

Hadoop MapReduce http://hadoop.apache.org/common/docs/r0.17.0/mapred_tutorial.html Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the Map-Reduce framework and the Distributed FileSystem are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The Map-Reduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Although the Hadoop framework is implemented in JavaTM, Map-Reduce applications need not be written in Java.

Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications to process vast amounts of data in HDFS. 提供高度的可靠性運算提供容錯機制降低網路傳輸的頻寬需求提供負載平衡 Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

從雲端運算看 MapReduce 應用層面大量數據資料的分析統計大量資料的排序彙整網頁存取紀錄的分析 … 3 3

Hadoop MapReduce 架構 Job Task JobTracker TaskTracker MapReduce的基本工作單位 Map Task及Reduce Task JobTracker 負責透過Task的排程管理，並收集各個Job的進度狀況，以便協調所有在這個系統上Job的執行。 TaskTracker 負責各種Task的執行，並回報執行進度給JobTracker。

工作提交 (Job Submission) 確認Job的輸入及輸出: output 目錄是否指定及存在處理Job輸入檔案的分割複製Job的jar檔(java執行檔壓縮)及相關設定到HDFS中MapReduce系統的目錄提交工作到JobTracker queue並監控其狀態

Job 排程 (Job Scheduling) 預設使用FIFO scheduler依序執行Job Fair scheduler 多使用者排程策略 Give every user a fair share of the cluster capacity over time. As more jobs are submitted, free task slots are given to the jobs in such a way as to give each user a fair share of the cluster. 支援搶占 (preemption) 機制 If a pool has not received its fair share for a certain period of time, then the scheduler will kill tasks in pools running over capacity in order to give the slots to the pool running under capacity

工作輸入 (Job Input) Input Split 將輸入的數據資料分割而成的資料片段資料片段的大小固定 (可調整) 每一個Split產生一個Map Task 考慮資料區域性最佳化 (Data Locality Optimization)

Map 階段 (Map Phase) Job的輸入檔案被分割為Split後，在Map階段轉換成中間資料 Map後的中間資料經過partition及排序後暫存於TaskTracker的本地端硬碟

Combine 階段 (Combine Phase) Map Task在產生中間資料之前可先進行Combine 省下部分數據資料傳輸的頻寬需求加快寫入硬碟節省硬碟空間 Map Task 1 (12, 30) (12, 10) (12, 15) (12, 19) Combine 1 (12, 30) Reduce Task (12, [30, 10, 15, 19, 22, 18, 9]) Reduce Task (12, [30, 22]) Map Task 2 (12, 22) (12, 18) (12, 9) Combine 2 (12, 22)

Reduce 階段 (Reduce Phase) Shuffle 將Map的輸出傳送到Reducer輸入的過程同時從多個Map Task下載中間資料根據叢集環境網路速度調整等待時間使用Memory Buffer來暫存下載回來的中間資料 Reduce 處理中間資料並將最終結果存放於HDFS

MapReduce 運作流程

Divide and Conquer Divide Conquer combine Map Task Reduce Task Job Map Task Reduce Task Final Result Map Task Reduce Task Map Task Reduce Task Map Task Reduce Task Reduce Tasks can more than Map Tasks

Map tasks and Reduce tasks are the HDFS Clients Architecture Map tasks and Reduce tasks are the HDFS Clients JT TT TT NN DN DN TT TT TT DN DN DN Cluster JT: JobTracker DN: TaskTracker NN: NameNode DN: DataNode

Work Flow A MapReduce job usually splits the input data-set into independent chunks The chunks are processed by the map tasks The framework shuffles the outputs of the maps, and then input to the reduce tasks The reduce tasks combine the outputs of maps, and store the final output to file system

HDFS Native File System splitting Map Input Chunks Output of Map Shuffle Reduce Output Input of Reduce

失敗 (Failure)處理機制 Task失敗 TaskTracker失敗 JobTracker失敗常見於使用者程式在Map或Reduce Task執行時異常 (Runtime Exception) TaskTracker失敗 TaskTracker因當機而失敗或執行緩慢時 JobTracker失敗目前沒有機制處理Jobtracker失敗失敗機率不常發生，因為特定機器故障的機率很低

hadoop running java programes in cluster - 跑JAVA簡單的程式在Hadoop叢集安裝可以參考pro-hadoop.PDF page329 ch10章

WordCount Example

WordCount範例程式環境: cluster有一台Namenode兩台Datanode ( 一)首先要將計算單字的文字檔上傳到HDFS檔案系統說明: 進入家目錄建立資料夾input，並加入兩個文件檔file1.txt、file2.txt 進入hadoop家目錄，將input資料夾複製到HDFS 檢視HDFS上的檔案

WordCount範例程式步驟: mkdir input cd input echo "hello world bye world" > file1.txt echo "hello hadoop bye hadoop" > file2.txt cd /opt/hadoop bin/hadoop dfs -copyFromLocal ~/input input bin/hadoop dfs -ls input

WordCount範例程式 (二) 編輯WordCount.java的程式可從網址下載檔案http://trac.nche.org.tw/cloud/attachment/wiki/jazz/Hadoop_Lab6/WordCount.java?format=raw

WordCount範例程式(1/3) import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

WordCount範例程式(2/3) public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

WordCount範例程式(3/3) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }

WordCount範例程式 (三)將WordCount.java封裝及編譯成 WordCount.jar 步驟: mkdir MyJava javac -classpath hadoop-*-core.jar -d MyJava WordCount.java jar -cvf wordcount.jar -C MyJava . bin/hadoop jar wordcount.jar WordCount Input/ Output/

WordCount範例程式 (四) 查看執行結果: bin/hadoop dfs -cat output/part-00000 結果如下: bye 2 hadoop 2 hello 2 world 2 參考:http://www.slideshare.net/waue/hadoop-map-reduce-3019713