Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop与数据分析 淘宝数据平台及产品部基础研发组 周敏 日期:2010-05-26 1.

Similar presentations


Presentation on theme: "Hadoop与数据分析 淘宝数据平台及产品部基础研发组 周敏 日期:2010-05-26 1."— Presentation transcript:

1 Hadoop与数据分析 淘宝数据平台及产品部基础研发组 周敏 日期: 1

2 Outline Hadoop基本概念 Hadoop的应用范围 Hadoop底层实现原理 Hive与数据分析 Hadoop集群管理
常见问题及解决方案 2

3 关于打扑克的哲学

4 打扑克与MapReduce 分牌 各自齐牌 再次理牌 搞定 交换 Input split shuffle output

5 统计单词数 a 1 the 1 weather 1 is 1 good 1 The weather good 1 is good a 1
today 1 is 1 good 1 Today is good guy 1 guy 1 is 4 is 1 this 1 guy 1 is 1 a 1 good 1 man 1 man 2 This guy is a good man the 1 man 1 this 1 the 1 today 1 good 1 man 1 is 1 this 1 Good man is good weather 1 today 1 weather 1

6 流量计算 6 6

7 趋势分析 7 7 7

8 用户推荐 8 8 8

9 分布式索引 9 9 9

10 Hadoop生态系统 Hadoop 核心 并行数据分析语言Pig 列存储NoSQL数据库 Hbase 分布式协调器Zookeeper
Hadoop Common 分布式文件系统HDFS MapReduce框架 并行数据分析语言Pig 列存储NoSQL数据库 Hbase 分布式协调器Zookeeper 数据仓库Hive(使用SQL) Hadoop日志分析工具Chukwa

11 Hadoop实现 Hadoop Cluster Data Results MAP Reduce DFS Block 1
Data data data data data Results Data data data data

12

13 作业执行流程

14 Hadoop案例(1) // MapClass1中的map方法
public void map(LongWritable Key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String strLine = value.toString(); String[] strList = strLine.split("\""); String mid = strList[3]; String sid = strList[4]; String timestr = strList[0]; try{ timestr = timestr.substring(0,10); }catch(Exception e){return;} timestr += "0000"; // 省略数十行 output.collect(new Text(mid + “\”” + “sid\”” + timestr , ...); }

15 Hadoop案例(2) public static class Reducer1 extends MapReduceBase implements Reducer<Text, Text, Text, Text> { private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String[] t = key.toString().split("\""); word.set(t[0]);// str.set(t[1]); output.collect(word,str);//uid kind }//reduce }//Reduce0b

16 Hadoop案例(3) public static class MapClass2 extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private Text word = new Text(); private Text str = new Text(); public void map(LongWritable Key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String strLine = value.toString(); String[] strList = strLine.split("\\s+"); word.set(strList[0]); str.set(strList[1]); output.collect(word,str); }

17 Hadoop案例(4) public static class Reducer2 extends MapReduceBase implements Reducer<Text, Text, Text, Text> { private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { while(values.hasNext()) { String t = values.next().toString(); // 省略数十行代码 } output.collect(new Text(mid + “\”” + sid + “\””) , ...)

18 Thinking in MapReduce(1)
B C A Filter Co-group A B D Group Filter B C D Function C Aggregate

19 Thinking in MapReduce(2)

20 SELECT COUNT(DISTINCT mid) FROM log_table
Hive的魔力 Magics of Hive: SELECT COUNT(DISTINCT mid) FROM log_table

21 为什么淘宝采用Hadoop? webalizer awstat 般若 Atpanel时代 Hadoop时代 日志最高达250GB/天
最高达约50道作业 每天运行20小时以上 Hadoop时代 当前日志470GB/天 当前366道作业 平均6~7小时完成 21

22 还有谁在用Hadoop? 雅虎北京全球软件研发中心 中国移动研究院 英特尔研究院 金山软件 百度 腾讯 新浪 搜狐 IBM Facebook
Amazon Yahoo!

23 Web站点的典型Hadoop架构 Web Servers Log Collection Servers Filers
Data Warehousing on a Cluster Oracle RAC Federated MySQL 23

24 淘宝Hadoop与Hive的使用 Scheduler Thrift Server Rich Client Client Program
Web Server CLI/GUI MetaStore Server Web Mysql JobClient

25 调试 标准输出,标准出错 Web显示(50030, 50060, 50070) NameNode,JobTracker, DataNode, TaskTracker日志 本地重现: Local Runner DistributedCache中放入调试代码

26 Profiling 目的:查性能瓶颈,内存泄漏,线程死锁等
工具: jmap, jstat, hprof,jconsole, jprofiler mat,jstack 对JobTracker的Profile 对各slave节点TaskTracker的Profile 对各slave节点某Child进程的Profile(可能存 在单点执行速度过慢)

27 监控 目的:监控集群或单个节点I/O, 内存及CPU 工具: Ganglia

28 如何减少数据搬动? 28 28 28

29 数据倾斜 29 29 29

30


Download ppt "Hadoop与数据分析 淘宝数据平台及产品部基础研发组 周敏 日期:2010-05-26 1."

Similar presentations


Ads by Google