EMC大数据 管理与分析 Isilon+Hadoop 毛永全 EMC大数据技术顾问 电话:13808006657 邮箱:mmao@isilon.com
今天的议程 虚拟化天文台 Isilon简介 大数据商机与Hadoop Hadoop 的技术难题及EMC 解决方案 Q&A Here’s what we’re going to cover in today’s session: Walk through agenda
中国虚拟天文台业务方向 打造天文科技领域云 数据开放共享服务:提供对国内/国际数据资源的快速访问,支持海量异构数据的过滤和融合 数据分析与挖掘环境:为科研用户提供支持海量高维复杂数据的加工处理与挖掘分 析环境,支持天文学相关课题的开展 Our core innovation and value to customers include: Scalability from a 3 node cluster to 144 nodes and to over 15PB In a single file system and single volume eliminating the management of multiple containers Performance that scales with capacity delivering over 100GB/s of throughput, with 1.6M IO/sec, Unmatched efficiency with over 80% utilization rates and automated storage tiering with Isilon SmartPools Enterprise data protection with efficient data backup and protection, reliable disaster recovery and WORM data retention. Management simplicity makes it easy scale capacity and performance without incurring an increase to OPEX all within a single file system, single volume, global namespace. Operational flexibility that includes integrated support for a wide range of industry- standard protocols including NFS, SMB, HTTP, FTP, iSCSI and now HDFS (Hadoop Distributed File System) (announced on January 31, 2012).
EMC一览 财富500强第152位 财富全球最受尊敬电脑公司 排名第2位 市值:$590亿 50,000员工,覆盖83个国家 在存储、备份、大数据、信 息安全和虚拟化领域,市场 第一
IT 颠覆性的趋势与机遇 移动 云计算 大数据 社交 可信
EMC聚焦 移动 云计算 大数据 社交 可信
Data Domain, Avamar, Networker EMC云计算与大数据平台 业务应用 大数据业务 Java Greenplum SAP VMware VMware Ionix VPLEX VMAX VNX Isilon Atmos 主存储 Data Domain, Avamar, Networker 备份 归档
EMC Isilon 典型架构 文件协议 主机端系统 RESTful API 多协议 应用层 标准的网络层 千兆/万兆 Isilon集群存储 NFS CIFS RESTful API GET PUT POST DELETE HTTP FTP Note to Presenter: View in Slide Show mode for animation. This slide provides an overview of the Isilon scale-out NAS architecture: Isilon is Multi-Protocol, supporting NFS, CIFS, HTTP, FTP, HDFS for Hadoop and Data Analytics, and REST for Object and Cloud computing requirements. At the Client/Application layer, the Isilon NAS architecture supports a wide range of operating system environments, as shown here. At the Ethernet level, the Isilon OneFS operating system supports key industry-standard protocols, including NFS, CIFS, and HDFS (Hadoop Distributed File System), and provides you with great interoperability for business applications as well as your data analytics activities. OneFS is a single file system/single volume architecture, which makes it extremely easy to manage, regardless of the number of nodes in the storage cluster. Isilon storage systems scale from a minimum of three nodes up to 144 nodes, all of which are connected with an InfiniBand communications layer. HDFS for Hadoop REST for Object Gig-e 10 Gig-e Network 多协议 应用层 标准的网络层 千兆/万兆 Isilon集群存储 集群存储内部通讯 InfiniBand 层
EMC Isilon概括 为客户带来的价值 巨大的可扩展性 创世界记录的性能 无与伦比的效率 企业数据保护 管理简单性 操作灵活性 在单个文件系统中扩展到 20 PB 以上 创世界记录的性能 超过 100 GB/秒的吞吐量,160 万次 SPECsfs 操作 Our core innovation and value to customers include: Scalability from a 3 node cluster to 144 nodes and to over 15PB In a single file system and single volume eliminating the management of multiple containers Performance that scales with capacity delivering over 100GB/s of throughput, with 1.6M IO/sec, Unmatched efficiency with over 80% utilization rates and automated storage tiering with Isilon SmartPools Enterprise data protection with efficient data backup and protection, reliable disaster recovery and WORM data retention. Management simplicity makes it easy scale capacity and performance without incurring an increase to OPEX all within a single file system, single volume, global namespace. Operational flexibility that includes integrated support for a wide range of industry- standard protocols including NFS, SMB, HTTP, FTP, iSCSI and now HDFS (Hadoop Distributed File System) (announced on January 31, 2012). 无与伦比的效率 超过 80% 的存储利用率,自动化存储分层 企业数据保护 高效的备份和恢复,可靠的灾难恢复,以及 WORM 数据保留,N+1 到 N+4 冗余 管理简单性 单个文件系统,单个卷,全局命名空间 操作灵活性 集成了多种行业标准协议支持,包括 NFS、SMB、HTTP、FTP、iSCSI 和 HDFS
Isilon OneFS 产品架构 SmartQuotas ™ 报告/管理存储资源的使用,同时拥 有精简供应功能 OneFS 为EMC Isilon专利的操作系统,负责IO 调度,集群管理 Smartconnect™提供应用访问负载均衡和容错 SmartPools ™ 提供自动分层功能 SnapshotIQ™提供本地数据保护/恢复 SyncIQ ™ 提供存储间的数据复制(本地/异 地),保证业务连续性 SmartQuotas ™ 报告/管理存储资源的使用,同时拥 有精简供应功能 SmartDedupe 重复数据消重 SmartLock™提供 WORM技术 InsightIQ™提供存储性能报表和使用趋势分析 HDFS特性支持Hadoop大数据应用 Isilon For Vcenter适用于虚拟化应用管理
强大但简易——横向扩展(Scale-out) 60 秒完成扩展 无停机时间
将效率提升到全新水平 自动平衡 当系统在线并处于生产状态时, 自动平衡功能可将内容迁移到新 的存储节点 跨节点自动平衡数据降低了扩展存储的成本、复杂性和风险 平衡 空 满 平衡 当系统在线并处于生产状态时, 自动平衡功能可将内容迁移到新 的存储节点 无需手动干预, 无需重新配置, 无需更改服务器或客户端装载点 或应用程序 消除了“热点” A bit more detail on the AutoBalance feature: The EMC Isilon AutoBalance feature migrates content to new storage nodes while system is online and in production This eliminates “Hot Spots”, and reduces costs, complexity, and risks for scaling storage. Note to Presenter: BUILD FOR FULL EFFECT 空 满 平衡 空 满 平衡 空 满 平衡 空
自动数据分层 新数据 SSD/SAS Diskpool 旧数据 SATA Diskpool
业内唯一的内嵌HDFS的横向扩展存储解决方案 MapReduce Compute 内嵌支持HDFS 管理简单 HDFS Storage
大数据与分析:EMC Hadoop 解决方案蕴含巨大商机 <To kick off the presentation>: Welcome the audience + thank them for joining us 充分发掘大数据的价值
!!! !!! !!! !!! !!! !!! !!! “发现:‘大数据’比卷更卓绝” “总数据:比大数据更‘大’” “大数据无关乎大小, 而关乎自由度” — Techcrunch !!! “发现:‘大数据’比卷更卓绝” — Gartner !!! !!! <This slide gives you the opportunity to remind the audience that “big data” is a huge topic of interest today (and with good reason).> I’m sure you’ve seen some of the articles in the press about “Big Data”. It seems as if everyone is talking about it. Some of you are probably living it today. There’s lots of interest in it but many aren’t exactly sure about what they should be doing about it. Big Data has been recognized world over for the potential impact it can have. Gartner has said that enterprise’s who embrace Big Data will outperform their peers financially by 20%. <click> “大数据! 它真实存在,实时提供,并且正在改变您的世界” ―IDC “总数据:比大数据更‘大’” — 451 Group !!! !!! !!!
大数据 时代已经到来 !!! !!! !!! !!! !!! !!! !!! “发现:‘大数据’比卷更卓绝” “总数据:比大数据更‘大’” “大数据无关乎大小, 而关乎自由度” — Techcrunch !!! 大数据 时代已经到来 “发现:‘大数据’比卷更卓绝” — Gartner !!! !!! Make no mistake about it The Era of Big Data is here Now, let’s look at a few industry examples about how “Big Data” can impact businesses. “大数据! 它真实存在,实时提供,并且正在改变您的世界” ―IDC “总数据:比大数据更‘大’” — 451 Group !!! !!! !!!
Hadoop 与大数据 Now let’s look at ”Hadoop” and its role on Big Data Analytics.
Hadoop 初展锋芒 创建于 6-7 年前 旨在分析海量非结构化数据的软件平台 两个核心组件: Hadoop 分布式文件系统 (HDFS)(存储) MapReduce(计算) 目前是大型开放源代码开发社区支持的首要 Apache 项目 Hadoop was developed 5-6 years ago to specifically address the need for “Big Data Analytics” At the time, development for Hadoop was being driven by the big Internet companies like Yahoo! And Google who were amassing a huge amount of unstructured data and needed a new way to analyze it because traditional approaches couldn’t handle this new “Big Data” challenge. The development of Hadoop was pioneered by Doug Cutting, a former Yahoo! Engineer Hadoop consists of 2 key elements: The “Hadoop Distributed File System” (HDFS) while handles the storage component of the system MapReduce which handles the “compute” function Today, Hadoop is an ‘open-source’ initiative, very similar to Linux, and backed by a large, open source development community who collaborate on “Apache Hadoop” As with Linux, there are a number of approved or authorized Apache Hadoop distributions, including EMC Greenplum’s “Greenplum HD”. <You may also want to note that “Hadoop” got it’s name from Doug Cutting’s son’s toy elephant. This also explains, the “elephant” that is often depicted on materials relating to Apache Hadoop.> Now let’s look at why hadoop is so important. 近年来,随着天文数据也呈现爆炸式的增长,数据处理的流程越来越呈现出海量和并行化的特征,数据格式 也出现非格式化和格式化的形式。加之数据处理的底层系统一般使用集群来搭建,在天文海量数据处理的 问题上,空间计算的复杂性和数据量的大规模化使得传统的并行数据处理流程的实现方法如DBMS、网格 计算等在性能和可扩展性的问题上难以满足天文应用的需求 MapReduce是一种简洁抽象的分布式计算模型。它不仅架构简单、免费开源、伸缩性强、可用性强以 及有效支持数据密集型应用,而且它很好地解决了并行计算的负载均衡、数据分布、容错、资源分配和网 络存储等方面的问题,使人们能轻松地操纵大规模的集群系统而无须考虑硬件细节,从而有效地提高了工 作效率
为什么 Hadoop 很重要 面向超大规模的实用分析方法 旨在应对非结构化数据的增长 开创获得洞察见解和发现商机的新方法 在未来 5 年内,企业数据将增长到现在的 650% 此增长中超过 80% 将是非结构化数据 One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business. Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future. Now let’s look at how the adoption of Hadoop is evolving.
Hadoop 的技术难题 It this section, we’re going to identify and describe the key technology challenges of Hadoop, especially when deployed using direct-attached storage (DAS).
Hadoop 的技术难题 1 2 3 4 5 6 Hadoop DAS 环境 专用存储基础架构 单点故障 缺乏企业数据保护 存储效率低 NameNode 3 缺乏企业数据保护 无快照、复制、备份 4 存储效率低 3 倍镜像 5 固定可扩展性 固定的计算/存储比率 6 手动导入/导出 无协议支持 Hadoop DAS 环境 NameNode One challenge associated with traditional deployments of Hadoop, is that it has largely been done on a dedicated infrastructure and not integrated with or connected to any other applications. In effect, a silo’d environment, often outside the realm of the IT team. This poses a number inefficiencies and risks. <Click to next slide>
Hadoop 的技术难题 1 2 3 4 5 6 Hadoop DAS 环境 专用存储基础架构 单点故障 缺乏企业数据保护 存储效率低 NameNode 3 缺乏企业数据保护 无快照、复制、备份 4 存储效率低 3 倍镜像 5 固定可扩展性 固定的计算/存储比率 6 手动导入/导出 无协议支持 Hadoop DAS 环境 NameNode A well-recognized issue with traditional Hadoop deployments is the “single-point- of-failure” problem with a the Hadoop NameNode. In a Hadoop environment, a single namenode manages the Hadoop filesystem. If it goes down, the Hadoop environment will immediately go off-line. While the Apache Hadoop open source team is working on ways to address this issue, it can still take hours or days –- the larger the amount of data, the longer it will take -- to recover from the loss of the NameNode. In the meantime, the system will be unavailable. <Click to next slide>
Hadoop 的技术难题 1 2 3 4 5 6 Hadoop DAS 环境 专用存储基础架构 单点故障 缺乏企业数据保护 存储效率低 NameNode 3 缺乏企业数据保护 无快照、复制、备份 4 存储效率低 3 倍镜像 5 固定可扩展性 固定的计算/存储比率 6 手动导入/导出 无协议支持 Hadoop DAS 环境 NameNode 1 倍 2 倍 3 倍 Another issue with traditional Hadoop environments is the lack of enterprise- level data protection. Typical Hadoop deployments do not have rigorous data protection backup and recovery capabilities such as snapshots or data replication capabilities for disaster recovery (DR) purposes. Traditional Hadoop deployments on direct-attached storage (DAS) are also extremely inefficient. It’s not unusual for a DAS environment to operate with a 30-35% storage utilization rate (or less). Compounding this inefficiency is the fact that data is often mirrored (the default is 3 times). In addition to storage inefficiency, this type of infrastructure is very management-intensive. <click to advance to next slide>
Hadoop 的技术难题 1 2 3 4 5 6 Hadoop DAS 环境 专用存储基础架构 单点故障 缺乏企业数据保护 存储效率低 NameNode 3 缺乏企业数据保护 无快照、复制、备份 4 存储效率低 3 倍镜像 5 固定可扩展性 固定的计算/存储比率 6 手动导入/导出 无协议支持 Hadoop DAS 环境 NameNode Another issue with Hadoop running with direct attached storage is that ‘server’ and ‘storage’ resources must be increased together in lock-step. For example, if more storage resources are required, a new server must be deployed (and vice versa). This rigidity adds additional inefficiencies. Another issue is the manual import/export of data that is required in a traditional hadoop environment. In addition to being time and resource (bandwith) consuming, the hadoop data in typical environments can not be accessed or shared with other enterprise applications due to the lack of industry-standard protocol support. To address these challenges and to enable enterprises to begin realizing the benefits of Hadoop quickly and easily, EMC has recently introduced an exciting new Hadoop solution.
适用于 Hadoop 的 EMC Isilon 优势 1 横向扩展存储平台 多个应用程序和工作流 2 无单点故障 分布式 NameNode 3 端到端数据保护 SnapshotIQ、SyncIQ、NDMP 备份 4 行业领先的存储效率 80% 以上的存储利用率 5 独立可扩展性 单独添加计算和存储 6 多协议 行业标准协议 NFS、CIFS、FTP、HTTP、HDFS EMC Isilon has recently introduced a new scale-out NAS solution for Hadoop that is designed to readily support business analytics as well other enterprise applications and workflows. (This eliminates the silo’d infrastructure approach used in many initial Hadoop deployments.) The new EMC solution also eliminates the “single-point-of-failure” issue. We do this by enabling all nodes in an EMC Isilon storage cluster to become, in effect, namenodes. This greatly improves the resiliency of your Hadoop environment. The EMC solution for hadoop also provides reliable, end-to-end data protection for Hadoop data including snapshoting for backup and recovery and data replication (with SyncIQ) for disaster recovery capabilities. Our new Hadoop solution also takes advantage of the outstanding efficiency of EMC Isilon storage systems. With our solutions, customers can achieve up to 80% or more storage utilization. EMC Hadoop solutions can also scale easily and independently. This means if you need to add more storage capacity, you don’t need to add another server (and vice versa). With EMC isilon, you also get the added benefit of linear increases in performance as the scale increases. EMC also recently announced that we are the 1st vendor to integrate the HDFS (Hadoop Distributed File System) into our storage solutions. This means that with EMC Isilon storage, you can readily use your Hadoop data with other enterprise applications and workloads while eliminating the need to manually move data around as you would with direct-attached storage. HDFS
EMC 可应对 Hadoop 难题 1 2 3 4 5 6 1 2 3 4 5 6 专用存储基础架构 横向扩展存储平台 单点故障 无单点故障 NameNode 3 缺乏企业数据保护 无快照、复制、备份 4 存储效率低 3 倍镜像 5 固定可扩展性 固定的计算/存储比率 6 手动导入/导出 无协议支持 1 横向扩展存储平台 多个应用程序和工作流 2 无单点故障 分布式 NameNode 3 端到端数据保护 SnapshotIQ、SyncIQ、NDMP 备份 4 行业领先的存储效率 80% 以上的存储利用率 5 独立可扩展性 单独添加计算和存储 6 多协议 行业标准协议 NFS、CIFS、FTP、HTTP、HDFS The EMC Isilon scale-out storage solution for business analytics is designed to address all of the Hadoop challenges. <click>
EMC 的企业 Hadoop 解决方案 Apache Hadoop 经 Greenplum 认证 简单的平台管理和控制 EMC Greenplum HD 和 EMC Isilon 横向扩展存储 计算 Apache Hadoop 经 Greenplum 认证 简单的平台管理和控制 使用 Greenplum Database 的并行分 析访问 EMC’s enterprise hadoop solution combines the power of EMC Greenplum HD, EMC’s “Apache Hadoop Distribution”, with EMC Isilon Scale-out NAS storage. [Note to speaker:The EMC Isilon scale-out solution supports any industry standard Apache Hadoop distribution. This means that if a customer prefers another Hadoop distribution instead of our Greenplum HD, we can support it.] The Greenplum HD software, depicted here at the top of the diagram, provides the “Compute” function while the Isilon storage (depicted at the bottom of the diagram) provides the “storage” function in the EMC Hadoop solution. Note that the “Hadoop Distribution File System (HDFS)” is integrated into the OneFS Operating system used by the EMC Isilon storage systems. Together, this solution provides a comprehensive Hadoop solution that is easy to implement and manage. It is also highly efficient, reliable and highly scaleable. Our Hadoop solution can also be easily augmented with additional EMC Greenplum technologies to expand your data analytics capabilities (these will be discussed later in the presentation). Now let’s look at how the EMC Hadoop solution is packaged. HDFS 存储
小结 Isilon服务于大数据应用 Isilon与 Hadoop 自然集成的企业级横向扩展存储平台 EMC提供众多专业分析工具、服务和专业知识 In summary, with EMC Isilon Scale-Out Storage Hadoop and EMC Greenplum analytics, we provide: The industry’s 1st and only scale-out storage platform that natively integrates with Hadoop A single-vendor solution designed to accelerate the benefits of Hadoop for your buiness An extensive array of big data analytics tools, services and expertise that your organization can leverage As a suggested next step, let’s discuss how our Big Data + Big Analytics solutions can benefit your business!
谢谢!
Thank you.