打造更高可靠的存储系统 IBM STG July 2007
主要内容 存储高可用性与业务连续性的关系 存储高可用性解决方案 存储高可用系统和灾备系统之间的配合与转换 成功案例
主要内容 存储高可用性与业务连续性的关系 存储高可用性解决方案 存储高可用系统和灾备系统之间的配合与转换 成功案例
最影响IT策略的驱动因素 问题: 在影响IT基础架构和IT管理软件的选择和使用策略中,哪三项是最关键的驱动因素? 40% 关键IT服务的中断 61% 业务连续性/ Security Requirements 53% 59% 业务应用的可用性 / Performance 43% 53% 40% IT 服务和支持费用过于昂贵 34% IT 管理策略 40% 提高内部IT职员的生产力 Strategic IT “Triggers” 30% 高IT投资要求 35% IT 基础架构策略 29% 整合新IT资源form M&A 18% 20% 22% 外包给第三方的压力 17% 8% 其它 4% 0% 10% 20% 30% 40% 50% 60% 70% Source: IT Infrastructure and IT Management, North American Adoption, Drivers and Triggers Survey, Oct 1, 2004 Percentage Based on Actual Respondent Count
各行业系统停机的成本 Meta Group 2001
本地高可用性是保障业务连续性的最基本环节 HA: 系统的一种能力,它能够对最终用户隐藏计划外的故障,提供相对无缝的服务能力。主要包括硬件容错功能、部件冗余、自动错误诊断、修复、迂回处理、再配置等,并具备预分析、测试、故障管理和变更管理等能力
本地高可用性和灾难备份之间的关系 HA ---- 关键词“本地” DR ---- 关键词“异地” 定位:针对生产中心的内部故障 通常能够满足非常快速的恢复时间 RTO / RPO要求严格 实时保护 实施相对简单 切换过程简单 DR ---- 关键词“异地” 定位:针对生产中心的机房或大面积设备故障 通常恢复时间较长 一般能够容忍部分数据丢失 实施相对复杂 切换过程复杂、涉及内/外多个环节和部门
高可用性的主要实现技术
高可用性的实现层次 应用服务器 数据库服务器 边缘设备 应用伸缩性 冗余 网络 服务器集群 并行数据库 冗余SAN 网络 双数据拷贝 RAID 5 或RAID 10 双数据拷贝 冗余SAN 网络 服务器集群 并行数据库 冗余 网络 应用伸缩性 边缘设备
为什么要建设存储高可用系统 “高可靠性”和“高可用性”是两个不同的概念 系统的整体可用性,往往取决于它最薄弱的环节。例如: 相比主机系统来讲,存储系统受损坏的几率更高 (机械部件、微码变更、数据丢失等) 相比主机系统来讲,存储系统承担的作用更加重要 解决存储的“硬件疲劳”和“软件异常”问题 服务器部分的独立可用性: 99.999% 存储部分的独立可用性: 98% 应用部分的独立可用性: 那么,整体可靠性为: 99.999% * 98% * 98% = 96.039%
主要内容 存储高可用性与业务连续性的关系 存储高可用性解决方案 存储高可用系统和灾备系统之间的配合与转换 成功案例
各类可用性实现方式概览 HA Tier 1 No Backup 2 Tape Backup 3 FlashCopy 4 Metro Mirror 4+ HACMP/XD 5p LVM Mirror 5z GDPS HyperSwap Configuration RTO (1) ? Days to Hours Hours to Days Minutes to Hours Seconds to Hours Seconds Relative RTO 8640 360 60 1 Forward Recovery N/A Yes No Backup Window Hours Data loss No (2) Application Fail Continue Relative Storage Hardware Cost (3) DS4000 1.5 2.4 3.5 3.4 DS6000 1.3 1.6 2.8 2.6 DS8000p 1.1 1.7 2.3 DS8000z 1.4 2.9 Note: (1) Recovery Time Objective (2) Assume all cache data on server is hardened. (3) Storage hardware cost comparison at 10TB effective user capacity
本地高可用性解决方案 IBM全面支持能力 业务价值 100% 的本地数据访问弹性 没有受磁盘设备故障导致的应用中断时间,或应用中断时间最小化 与远程灾备系统相辅相成 方便易行的数据保护和故障恢复过程 IBM全面支持能力 主机、存储的技术配合 多种结构解决方案 系统评估和规划 本地化的服务团队 DS8000 / 4000 HACMP Backup Server Active Server SAN 方案一:磁盘间数据镜像 LVM Copy 1 Copy 2 DS8000 / 4000 HACMP Backup Server Active Server SAN Primary Copy Target Copy PPRC (Metro Mirror) 方案二:磁盘间数据复制 目前的典型架构( 主机高可用、存储未高可用 ) 方案一:磁盘间数据镜像 方案二:磁盘间数据复制 HACMP HACMP LVM LVM Active Server Active Server Backup Server Backup Server SAN SAN SAN PPRC (Metro Mirror) Copy 1 Primary Copy Copy 2 Target Copy DS8000 / 4000 DS8000 / 4000
解决方案一:磁盘设备间数据镜像 (AIX LVM Mirror) 通过LVM卷管理软件实现 主机 磁盘1 磁盘2 LVM advantage: The only vendor who can support both LVM mirroring and disk replication LVM data mirroring is a solution with host and storage technologies, IBM provides state-of-the-art solutions with these capability The solution requires less investment, enables easy management without any fail-over operation LVM is a feature incorporated in AIX, with much less implementation efforts and costs 通过LVM卷管理软件实现 磁盘1故障时,磁盘2不需要重新在主机上mount,应用无需中断。真正实现了无缝接管
解决方案一的特点 100%持续可用性 不需要切换操作,实现和管理简单 双磁盘写,对性能影响轻微 投资少,LVM是p系列主机AIX中的缺省技术,不需要额外的软件购买费用。只需少量的实施费用 IBM是唯一能够同时提供LVM镜像和磁盘复制的厂商 技术成熟
解决方案二:磁盘设备间数据复制 (PPRC / RVM ) 通过磁盘硬件的数据复制功能实现 Active Server Backup Server HACMP SAN Active Disk Backup Disk RAID 5 or RAID 10 Redundant Network pSeries PPRC 通过磁盘硬件的数据复制功能实现 DS8000 / 6800 / ESS800: PPRC ( Metro Mirror ) DS4000: RVM
解决方案二的特点 数据零丢失,切换快捷 不受主机平台限制,对服务器透明 技术成熟,众多成功案例 实施简单、管理方便 未来容易改造成三站式业务连续性方案
异构存储环境解决方案(SVC) 主机 备机 Other IBM 通过SVC(SAN卷控制器)实现 支持业界主流的磁盘存储设备
异构存储解决方案的特点 不受主机平台限制 不受存储平台限制 可以在现有SAN结构上平滑改造 额外的业务价值 无缝的在线数据迁移 多个磁盘空间利用率均衡化 减少功能软件购买成本
主要内容 存储高可用性与业务连续性的关系 存储高可用性解决方案 存储高可用系统和灾备系统之间的配合与转换 成功案例
灾备系统的建设定位 硬件故障 意外删除 软件/应用错误 恶意操作 环境灾害 Server, network device, disk 在系统配置期间 在整理存储卷期间 软件/应用错误 数据毁坏 恶意操作 病毒、黑客 环境灾害 水灾、火灾、大面积长时间的停电事故等…… 高可用性和快速系统恢复 快速数据恢复 灾难恢复
覆盖范围比较 能够应对的故障、灾难范围 项目建设涉及环节 部件故障 单个设备硬件故障 单个设备电力故障 机房内局部火灾或其它损坏 生产中心整体灾难(火灾、水灾等) 区域性灾难(电力、通讯、水灾等) 地震、台风、海啸等大面积灾难 战争等灾难 Local HA DR 项目建设涉及环节 灾备中心机房 一把手工程 业务部门配合 人员编制 运行体系 技术 线路等运营费用 Local HA DR
不同层次业务连续性的相关定义与概念 从对应磁盘镜像中恢复 从磁带拷贝中恢复 Cost Tier 7 – 具有自动恢复功能的磁盘镜像 Tier 6 – 不具备自动功能的磁盘镜像 Tier 5 – 基于软件的复制 When we discussed the hardware infrastructure, we noted that the challenge in most enterprises is that there is a wide variety of value points in data but there are generally a limited few cost points in the underlying storage infrastructure – making it difficult to effectively map the value of information to the appropriate cost of storage. There is a similar conversation with Advanced Copy Services. The same enterprise data that has different value points also has different recovery requirements. Recovery requirements are generally measured using 2 metrics. The first is the Recovery Point Objective (RPO). The RPO can be thought of as the degree of difference between the active online data and the disaster recovery copy of that data. A RPO of zero would mean that the primary copy and the disaster recovery copy are in exact synchronization. A failure would result in zero loss of data. Intuitively, this is what every IT manager would like to have. However, it is generally quite expensive to implement. Some, maybe all data (depending on your business) can stand a longer RPO – meaning that a failure would result in some transactional data being lost. The other metric is the Recovery Time Objective (RTO). The RTO is the amount of time after a failure that you are willing to spend before a given application or group of data is back up and available. A RTO of zero means that failures should cause zero disruption. Again, this is what most IT managers would love to have – if cost was not a factor. The thing we want to accomplish with Advanced Copy Services is to implement multiple levels of recoverability, with multiple levels of associated cost, so that IT managers can do a more effective job of mapping the value and recovery needs of their data to the most appropriate recovery capability. By design, IBM offers purpose-built advanced copy services all along this recovery hierarchy. (click) Tier 4 – 时间点磁盘拷贝 Tier 3 – 电子数据传输 Tier 2 – 有备份机房的磁带备份 Tier 1 – 磁带备份 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days 恢复时间目标
存储高可用性设计为灾备系统建设奠定了技术基础 Application Replicate Application Middle Ware DB Interface Database Dev Interface OS SAN Disk Application Middle Ware DB Interface Database Dev Interface OS SAN Disk TwoWay Commit Log replay Log Shipping 易 实施难度 难 易 管理难度 难 易 扩展难度 难 小 开发投资 大 小 服务投资 大 大 硬件投资 小 GeoRM,MMIX VVR LVM Mirror Symphony, IPStor/Maxxan PPRC, PPRC-XD SRDF, True Copy
转换模式示例:3站式业务连续性方案 For customers with an absolute requirement for constant access to data IBM Metro/Global Mirror for the DS8000 offers 3-site disaster recovery/backup support as a generally available function A three-site, cascading remote copy solution using the IBM System Storage DS8000 Uses separately priced IBM Metro Mirror, IBM Global Mirror, and IBM FlashCopy® functions Benefits Designed to maintain a consistent and re-startable copy of the data at the remote site, with minimal impact to applications at the local site Designed to allow data consistency to be managed across multiple machines configured in a single session, with data currency at the remote site lagging behind the local site by as little as 3 to 5 seconds Designed to support failover / failback modes for efficient resynchronization, with incremental resync of changes in event of loss of any of sites (Includes site A to C resync for DS8000) Solution can help reduce load on site A in comparison to multi-target (non-cascaded) 3-site mirroring solutions Advantaged in requiring less network bandwidth and therefore lower cost than offerings from competing vendors Asynchronous copy using Global Mirror (virtually unlimited distances) Synchronous copy using Metro Mirror Up to 300km (>300km with RPQ)
主要内容 存储高可用性与业务连续性的关系 存储高可用性解决方案 存储高可用系统和灾备系统之间的配合与转换 成功案例
国内部分成功案例
案例一:铁路行业某客户 IBM HACMP p570 M85 DS6800 LVM SAN LAN
案例二:石油行业某客户
总结 – 选择IBM存储高可用解决方案的原因 PROTECT mission-critical applications: with continuous data availability 护航 关键业务的数据可用性 COMPREHENSIVE offerings: advanced data mirroring technologies 全面 集数据保护技术之大成 SIMPLIFIED deployment: local distance and transparency to applications 简化 透明于应用的本地方案 EXPERTISE leveraged: with complete sets of HA consulting and services 专业 深入的咨询和服务 30
Thank you