Haduzilla - Building hadoop cluster with Debian preseed 黑肚龍:無人值守自動安裝 Hadoop 叢集 Haduzilla - Building hadoop cluster with Debian preseed 黑肚龍:無人值守自動安裝 Hadoop 叢集 Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang
2 WHO AM I ? 這傢伙是誰啊? JAZZ ? 講者介紹: – Jazz Yao-Tsung NCHC / NCTU ECE Master – 國網中心 王耀聰 副研究員 / 交大電控八九級碩士 – 所有投影片、參考資料與操作步驟均在網路上 All the slides could be found at – FOSS End User FOSS 使用者 Debian/Ubutnu Access Grid Motion/VLC Red5 Debian Router DRBL/Clonezilla Hadoop FOSS Promoter 自由軟體推廣者 DRBL/Clonezilla Partclone/Tuxboot Hadoop Ecosystem FOSS Developer 行動力薄弱的開發者 TRTC WSU/ Hadop4Win / Haduzilla / Ezilla
3 Data Explosion!! 始於 2007 的「資料大爆炸」時代 出處: The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, March 2007, An IDC White Paper - sponsored by EMC 年, IDC 預估 2010 年會成長六倍! (相較 2006 年) EB EB ( 預測 )
4 出處: Extracting Value from Chaos, June 2011, An IDC White Paper - sponsored by EMC 追蹤歷年的 IDC 數據: EB EB EB EB (0.8 ZB) EB ( 預測 ) EB (1.2 ZB) EB ( 預測 ) EB (1.8 ZB) Digital Universe expanded 1.6x each year!! 每年約 1.6 倍 景氣差而成長趨緩? 或受新技術抑制?
5 Now we all need to store and process BIG DATA!!
6
7 Features of Hadoop... Hadoop 這套軟體的特色是... 海量 Vast Amounts of Data – 擁有儲存與處理大量資料的能力 – Capability to STORE and PROCESS vast amounts of data. 經濟 Cost Efficiency – 可以用在由一般 PC 所架設的叢集環境內 – Based on large clusters built of commodity hardware. 效率 Parallel Performance – 透過分散式檔案系統的幫助,以致得到快速的回應 – With the help of HDFS, Hadoop have better performance. 可靠 Robustness – 當某節點發生錯誤,能即時自動取得備份資料及佈署運算資源 – Robustness to add and remove computing and storage resource without shutdown entire system.
8 Which companies are powered by Hadoop ?? 有哪些公司在用 Hadoop 這套軟體 ?? Yahoo is the key contributor currently. IBM and Google teach Hadoop in universities … The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth) – from Facebook Tweeter
9 Hadoop in production run.... 商業運轉中的 Hadoop 應用.... February 19, 2008 Yahoo! Launches World's Largest Hadoop Production Application
10 You can store and process BIG DATA via Large Cluster!!
Common method to deploy Cluster in Labs 1. Setup one Templatemachine Templatemachine 2. Cloning tomultiplemachine tomultiplemachine 3. Configure Settings↓ 4. Install JobScheduler↓ 5. Running Benchmark 3. Configure Settings↓ 4. Install JobScheduler↓ 5. Running Benchmark
Challenges of common method in Labs Upgrade Software ? Add New User Account ? Configuration Syncronization How to share user data ?
How to deploy Nodes ?!
Source: Deploying hadoop with smartfrog
Source: Deploying hadoop with smartfrog
If you need to deploy in Cloud - try Puppet 如果要在 Amazon EC2 上佈署 Hadoop 等軟體,可以考慮 Puppet 因為作業系統已由虛擬機器的範本裝好了,只能用「有碟」的作法!
17 Can I install ONE server to deploy hadoop cluster ?
Yes, use DRBL to deploy Hadoop Need to build new debian packages drbl-hadoop – Mounting local disk for HDFS and MapReduce svn co hadoop-register – for multiuser registration and ssh client svn co
About hadoop.nchc.org.tw DRBL Server x 1 Node (hadoop) DRBL Client x 20 Nodes (hadoop101~hadoop120) Powered by Debian Squeeze 6.0.4
使用者註冊頁面 Hadoop-Register Powered by Zterm
系統狀態監控 Ganglia 採用自由軟體 Ganglia 來蒐集電腦叢集的負載狀態
DRBL+Hadoop=Haduzilla 黑肚龍系統架構
23 Can you help me to deploy my own multiuser hadoop cluster like hadoop.nchc.org.tw ?
In Year 2009, I released DRBL- Hadoop Live CD 舊影片: 下載點:
25 But I want it installed to disks for production …. What should I do ?
On 11 Feb 2011, 4$ shared about preseed! Source: 感謝 4$ 大大分享 Debian 6.0 自動化安裝
1st, We install Base System of GNU Debian Linux with Debian Installer and Preseed …... According to i/squeeze/preseed.cfg It will install (1) Base Packages of Debian (2) DRBL, JVM, Hadoop, etc.... (3) Run late_command script 1st, We install Base System of GNU Debian Linux with Debian Installer and Preseed …... According to i/squeeze/preseed.cfg It will install (1) Base Packages of Debian (2) DRBL, JVM, Hadoop, etc.... (3) Run late_command script Linux Kernel Kernel Module GNU Libc Boot Loader Debian Netinst CD
After reboot, we had installed DRBL package and rc.local script will configure it as DRBL Server. There are lots of service needed: SSHD, DHCPD, TFTPD, NFS Server, NIS Server, YP Server... After reboot, we had installed DRBL package and rc.local script will configure it as DRBL Server. There are lots of service needed: SSHD, DHCPD, TFTPD, NFS Server, NIS Server, YP Server... DHCP D TFTPDTFTPDNFSNFS BashBashPerlPerl Network Booting YPYPNISNIS Account Mgnt. DRBL Server based on existing Open Source and keep Hacking! DRBL Server based on existing Open Source and keep Hacking! SSH D JVMJVMHadoopHadoopApacheApacheGangliaGanglia DRBL Server Hadoop Server Linux Kernel Kernel Module GNU Libc Boot Loader
pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe Config. Files Ex. hostname Config. Files Ex. hostname rc.local script will run “drblsrv” & “drblpush”, there will be pxelinux, vmlinux-pex, initrd-pxe in TFTPROOT, and different configuration files for each DRBL Client in NFSROOT rc.local script will run “drblsrv” & “drblpush”, there will be pxelinux, vmlinux-pex, initrd-pxe in TFTPROOT, and different configuration files for each DRBL Client in NFSROOT Linux Kernel Kernel Module GNU Libc Boot Loader DHCP D TFTPDTFTPDNFSNFSYPYPNISNIS SSH D
BIOS PXE 3nd, We enable PXE function in BIOS configuration. 3nd, We enable PXE function in BIOS configuration. pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader DHCP D TFTPDTFTPDNFSNFSYPYPNISNIS SSH D
BIOS PXE While Booting, PXE will query IP address from DHCPD. While Booting, PXE will query IP address from DHCPD. pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader TFTPDTFTPDNFSNFSYPYPNISNIS SSH D DHCP D
IP 1 IP 2 IP 3 IP 4 While Booting, PXE will query IP address from DHCPD. While Booting, PXE will query IP address from DHCPD. pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader TFTPDTFTPDNFSNFSYPYPNISNIS SSH D DHCP D
IP 1 IP 2 IP 3 IP 4 After PXE get its IP address, it will download booting files from TFTPD. Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader NFSNFSYPYPNISNIS SSH D DHCP D pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe TFTPDTFTPD
IP 1 IP 2 IP 3 IP 4 Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader NFSNFSYPYPNISNIS SSH D DHCP D pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe TFTPDTFTPD pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd
Config. Files Ex. hostname Config. Files Ex. hostname Linux Kernel Kernel Module GNU Libc Boot Loader YPYPNISNIS SSH D DHCP D initrdinitrdinitrdinitrdinitrdinitrd IP 1 IP 2 IP 3 IP 4 pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe TFTPDTFTPD After downloading booting files, scripts in initrd-pxe will config NFSROOT for each Compute Node. NFSNFS
Linux Kernel Kernel Module GNU Libc Boot Loader YPYPNISNIS SSH D DHCP D initrdinitrdinitrdinitrdinitrdinitrd IP 1 IP 2 IP 3 IP 4 pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz pxelinuxpxelinux vmlinuzvmlinuz initrdinitrd pxelinuxpxelinux vmlinuz-pxevmlinuz-pxe initrd-pxeinitrd-pxe TFTPDTFTPD Config. Files Ex. hostname Config. Files Ex. hostname NFSNFS Config. 1 Config. 2 Config. 3 Config. 4
DRBL Server YPYPNISNIS DHCP D TFTPDTFTPDNFSNFS HadoopHadoopJVMJVM SSH D JVMJVM HadoopHadoop SSHDSSHD JVMJVM HadoopHadoop SSHDSSHD JVMJVM HadoopHadoop SSHDSSHD JVMJVM HadoopHadoop SSHDSSHD Applications and Services will also deployed to each Compute Node via NFS.... Applications and Services will also deployed to each Compute Node via NFS....
DRBL Server DHCP D TFTPDTFTPD With the help of NIS and YP, You can login each Compute Node with the Same ID / PASSWORD stored in DRBL Server! With the help of NIS and YP, You can login each Compute Node with the Same ID / PASSWORD stored in DRBL Server! NFSNFS SSH D YPYPNISNIS SSHDSSHDSSHDSSHDSSHDSSHDSSHDSSHD SSH Client
Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang DemoDemo
WANWAN Debian netinst CD tap0 eth0:1 eth0 iptables
Attribution-Noncommercial-Share Alike 3.0 Taiwan These slides could be distributed by Creative Commons License.
Questions? Slides Questions? Slides Jazz Wang Yao-Tsung Wang Jazz Wang Yao-Tsung Wang