Presentation is loading. Please wait.

Presentation is loading. Please wait.

HBase 簡介 : 資料格式與運作架構  Hubert 范姜-亦思科技.

Similar presentations


Presentation on theme: "HBase 簡介 : 資料格式與運作架構  Hubert 范姜-亦思科技."— Presentation transcript:

1 HBase 簡介 : 資料格式與運作架構  Hubert 范姜-亦思科技

2 Agenda Story of HBase Powered by HBase Features of HBase
Infrastructure(Responsibility of Nodes) Architecture Take a Look !

3 Story of HBase 2003 “The Google File System”
2004 “MapReduce: Simplified Data Processing on Large Clusters” 2006 “Bigtable: A Distributed Storage System for Structured Data”

4 Features of HBase Distributed Versioned Key/Value Database 分散存放
每一個Cell的資料都可以有多個版本存在 Key/Value Database Column-Oriented?

5 Features of HBase Non-Relational Base on Hadoop "NoSQL" Database
沒有Primary Key, Foreign Key存在 Base on Hadoop 架設在Hadoop檔案系統之上可以有比較好的效果 "NoSQL" Database 不使用SQL存取資料,也不同於SQL存取資料庫的模式 Strictly Consistency

6 Members and Contributors

7 Powered by HBase Adobe : We are using HBase in several areas from social services to structured data and processing for internal use. We constantly write data to HBase and run mapreduce jobs to process then store it back to HBase or external systems. Yahoo:  to store document fingerprint for detecting near-duplications, We use this for querying duplicated documents with realtime traffic. => Hortonworks (thousands of nodes) Facebook: Message

8 Hbase at Twitter Data in Twitter HDFS
Cassandra ( Created by Facebook ) HBase FlockDB ( Created by Twitter ) fault-tolerant graph database

9 Hbase at Facebook Data in Facebook HDFS
Cassandra ( Created by Facebook ) HBase

10 NoSQL Database的選擇 CAP理論 CA? AP? 一致性(Consistency) (所有節點在同一時間具有相同的數據)
可用性(Availability) (保證每個請求不管成功或者失敗都有響應) 分隔容忍(Partition tolerance) (系統中任意信息的丟失或失敗不會影響系統的繼續運作)

11 Responsibility of Nodes

12 Responsibility of Nodes
Client HBase的終端使用者,可以透過HBase Shell或HBase Client API連接到HBase Cluster。

13 Responsibility of Nodes
Master 分派Region Server必須管理的Region範圍。 負責Region Server的負載平衡(Load Balance)。 偵測故障的Region Server並重新分配其上的Region由其他Region Server接手管理。 HDFS上的垃圾文件回收。 更新Table Schema。

14 Responsibility of Nodes
Region Server Region Server維護Master分配的Region,處理對所屬Region的IO請求。 Region Server負責切分在運行過程中儲存空間超過門檻值的Region。

15 Responsibility of Nodes
Zookeeper:以Google的Chubby為藍本實現的開源 軟體,是一個分散式系統的協調工具。 選擇Master。 儲存Region的Mapping資料。 監控Region Server的狀態,即時通知Region server的啟動與斷線信息給Master。 儲存HBase的Schema,包括有哪些Table,每個Table有哪些Column Family。

16 Responsibility of Nodes
n個,n>=1 ZooKeeper ZooKeeper Master ZooKeeper Master Master 單數個 Region Server Region Server Region Server Region Server ……. ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper Quorum中除了存儲了-ROOT-表的地址和HMaster的地址,HRegionServer也會把自己以Ephemeral方式註冊到Zookeeper中,使得HMaster可以隨時感知到各個HRegionServer的健康狀態。此外,Zookeeper也避免了HMaster的單點問題,見下文描述, Watcher HMaster沒有單點問題,HBase中可以啟動多個HMaster,通過Zookeeper的Master Election機制保證總有一個Master運行,HMaster在功能上主要負責Table和Region的管理工作: 1.       管理用戶對Table的增、刪、改、查操作 2.       管理HRegionServer的負載均衡,調整Region分佈 3.       在Region Split後,負責新Region的分配 4.       在HRegionServer停機後,負責失效HRegionServer 上的Regions遷移

17 Architecture - Data Structure

18 Data Format

19 RDB Data Format Lot_ID Date Facility Operator A000001.00 2012/06/15
BSET Andy A DSET Mike A Hubert

20 HBase Data format

21 Region Table (HBase Table) Region (Regions for the Table)
Store (Store per ColumnFamily for each Region for the table) MemStore ( MemStore for each Store for each Region for the table) StoreFile (StoreFiles for each Store for each Region for the table) Block (Blocks within a StoreFile within a Store for each Region for the table)

22 Region Region

23 Memstore Flush Flushing the memstore to disk causes a HFile

24 HTable Region Region Region Region Region Region Store Store File
HFile Memstore Store Store File HFile Memstore Split/Compaction 一個CF一個Store Block Block 一次flush產生一個HFile

25 HFile hbase中hfile的默認最大值 (hbase.hregion.max.filesize)是256MB

26 Compaction 合併多個HFile => one Hfile Two Types
Minor Compaction (部分文件合併) Major Compaction (完整文件合併) 刪除過期&已刪除的data 一個store只會有一個storefile

27 Compaction的好處 減少Hfile的個數 提高Performance 刪除過期&已刪除的data

28 Performance Notes hbase.hregion.max.filesize = ? File size 比較小時
易發生Split (Split會將region offline) File size比較大時 Split發生機會低 Compaction發生機會高(io成本比較高)

29 Performance Notes Table中CF與Qualifier的差別 以讀來思考 All rows => CF ?
All rows => Qualifier (one CF) ? CF的優勢=> 同一個CF會存在同一個Hfile 一次scan會取出同一個rowkey下整個CF的資料(CF可指定)

30 Performance Notes Table中CF與Qualifier的差別 以寫來思考
CF不宜過多 =>易造成集體Flush & Compaction(compaction storms) Reference: Flush and Compaction是以Region為單位 過多的CF => 不同的CF(Store Instance) 都在同一個Region下面,而每個

31 Performance of Keys

32 Take a look! HBase Client


Download ppt "HBase 簡介 : 資料格式與運作架構  Hubert 范姜-亦思科技."

Similar presentations


Ads by Google