Cloud Computing and its Technologies and Applications

Slides:



Advertisements
Similar presentations
MMN Lab 未來教室與雲端化學習 Yueh-Min Huang Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan
Advertisements

Web Role 的每台虚机运行有 IIS ,用于处理 Web 请求 Worker Role 用于运行后台进程 Cloud Service 是什么? 支持多层架构的应用容器 由多个 Windows 虚拟机集群构成 集群有两种类型: Web 和 Worker Cloud Service 做什么 进行应用的自动化部署.
13-1 人工智慧 13-2 雲端運算 13-3 感測網路與物聯網 13-4 生物資訊 13-5 計算機萬能嗎?
云计算辅助教学风云录 黎加厚 上海师范大学教育技术系 2010年8月9日.
云计算及安全 ——Cloud Computing & Cloud Security
DATE: 14/10/2009 陳威宇 格網技術組 雲端運算相關應用 (Based on Hadoop)
教育雲端科技的現況與未來發展 臺北市政府教育局聘任督學 韓長澤.
第6章 資料庫管理系統 6-1 關聯式資料庫管理系統 6-2 SQL Server資料庫管理系統
台灣雲端運算應用實驗中心研發計畫 計 畫 期 間:自98年7月1日至99年6月30日止 執行單位名稱 :財團法人資訊工業策進會 國立中山大學.
資料庫設計 Database Design.
第8章 系統架構.
HADOOP的高能物理分析平台 孙功星 高能物理研究所/计算中心
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
寻找适合您的工业4.0 Dell/曾峰.
大数据在医疗行业的应用.
Introduction to MapReduce
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
YARN & MapReduce 2.0 Boyu Diao
Dave 云的未来: PaaS软件 Dave
高级软件工程 云计算 主讲:李祥 QQ: 年12月.
雲端運算虛擬主機技術的發展.
Cloud Computing XuZhicheng
異質計算教學課程內容 「異質計算」種子教師研習營 洪士灝 國立台灣大學資訊工程學系
王耀聰 陳威宇 國家高速網路與計算中心(NCHC)
作業系統 補充: 雲端運算.
和諧社區資訊服務推廣計畫 -軟體雲端社區 資訊研習營
Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)
圖形溝通大師 Microsoft Visio 2003
中国散裂中子源小角谱仪 的实验数据格式与处理算法 报告人:张晟恺 中国科学院高能物理研究所 SCE 年8月18日
Popular Uses of ABC/M - the 1st half
Cloud Computing(雲端運算) 技術的現況與應用
斯巴達帶大家上雲端.
崑山科技大學 曾 龍 資訊工程系系主任 數位生活研究所所長 雲端運算與資通安全研發中心主任
Decision Support System (靜宜資管楊子青)
Android 课程讲义 智能手机开发
创建型设计模式.
巨量資料分析與應用 (1) 楊立偉教授 台大工管系暨商研所 2014 Fall.
國立屏東高級工業職業學校 雲端網路及 雲端開系統介紹
常见问题解答 II. App上重置并清空数据库之后,手机app找不到圣诞灯怎么办? I. 打开APP,发现并连接不了圣诞灯怎么办?
新世代電子商務(二): 裝置服務化與行動商務
微软新一代云计算 面向企业的 Office 365 客户培训大纲
第4章(1) 空间数据库 —数据库理论基础 北京建筑工程学院 王文宇.
Decision Support System (靜宜資管楊子青)
引例 问题1 从甲、乙、丙3名同学中选出2名参加某天的一项活动,其中1名同学参加上午的活动,1名同学参加下午的活动,有多少种不同的方法?
10.2 排列 2006年4月6日.
练习: 由三个不同的英文字母和三个不同的阿拉伯数字组成一个六位号码(每位不能重复),并且3个英文字母必须合成一组出现,3个阿拉伯数字必须合成一组出现,一共有多少种方法?
Connecting Education and Career through Learning
資料庫 靜宜大學資管系 楊子青.
Sensor Networks: Applications and Services
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Maintaining Frequent Itemsets over High-Speed Data Streams
Real-Time System Software Group Lab 408 Wireless Networking and Embedded Systems Laboratory Virtualization, Parallelization, Service 實驗室主要是以系統軟體設計為主,
橫跨電腦、手機與軟體的全方位端點管控解決方案
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
Cisco Troubleshooting and Maintaining Cisco IP Networks (TSHOOT)
中美图书馆之间合作的过去、现在和未来 Sino-U. S
Google Local Search API Research and Implementation
A Data Mining Algorithm for Generalized Web Prefetching
資料庫系統實驗室 指導教授:張玉盈.
百万亿次超级计算机诞生记 姓名 Xiangyu Ye 职务 微软中国技术中心资深HPC顾问 公司 微软中国
11 Overview Cloud Computing 2012 NTHU. CS Che-Rung Lee
Operating System Software School of SCU
Cloud Computing XuZhicheng
MGT 213 System Management Server的昨天,今天和明天
ADX series Configuration
Experimental Analysis of Distributed Graph Systems
ppt宝藏提供 中国银行业信息化系统建设研讨会
Section 1 Basic concepts of web page
Presentation transcript:

Cloud Computing and its Technologies and Applications Ming-Syan Chen 陳銘憲 Director Research Center for Information Technology Innovation Academia Sinica March 19, 2010

Received talk request… What to talk? In an hour Be relevant

Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 3

雲端運算定義 (一)雲端運算是一種經由網路取得遠端電腦運算服務的商業模式與技術組成 1.資料(data)與軟體服務(service)移往網際網路上大型(通常為1萬台以上主機)、可延展(scalable)、共用(shared)的資料中心 2.雲端資料中心提供無限延展的運算、儲存與應用程式,使用者利用具備網際網路連線能力之電腦終端裝置(device),即等同擁有一部虛擬超級電腦 (二)未來電腦運算就像是水、電一樣,只要連上網路就可以使用,不必各自投資發展 M.-S. Chen

Cloud Operation System 未來雲端資料中心的樣貌與效益 Thin Client Cloud Operation System 資料來源:ISI INTERNATIONAL STRATEGY & INVESTMENT Source: www.spectrum.ieee.org 02/2009 M.-S. Chen

Why Cloud Computing Better resource sharing Smaller overall IT cost Better management of computing resources Easy resource access for users Better SW/AP dissemination Facilitating ubiquitous computing/LBS Prosperous in view of technology trend HW cost, network availability, value of SW, etc M.-S. Chen

Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 7

雲端運算創造 台灣資訊業高值化、服務化轉型升級契機 雲端運算科技,徹底改變資訊產業供應鏈樣貌與資訊科技應用方式,對資訊產業產生重大影響: (一)一般使用者的電腦終端,不需要安裝軟體,使用雲端軟體服務(SaaS, Software-as-a-Service),上網就可以使用軟體應用程式服務 (二)程式設計師不用安裝軟體應用程式開發軟體,使用雲端應用平台服務(PaaS, Platform-as-a-Service),上網就可以完成軟體程式系統開發 (三)企業不用自行建置機房,使用雲端資料中心服務 (IaaS, Infrastructure-as-a-Service),上網就可以購買主機運算與儲存空間,發展自己需要的資訊系統 台灣資通訊產業擅長製造,面對這一波雲端運算以軟體服務為主的競爭時代,台灣資訊產業必須轉型升級高附加價值的系統製造與應用服務發展。 M.-S. Chen

雲端運算為台灣硬體產業帶來機會: 雲端貨櫃型電腦 1980年代 – “ 每個人的桌上都有一台 個人電腦” 2010年代 – “ 每個人在行動中擁有一台 超級電腦” 3000 貨櫃型電腦 (雲端運算) Internet Billion (US$) 行動裝置 (行動運算) 提供行動服務 1000 個人電腦 (個人運算) 提供個人&辦公室應用 300 Mainframe (集中運算) 傳統企業ERP/MRP, 自動化等應用 100 1939 1980 1995 2005 2015 Year 「雲端貨櫃型電腦資料中心」讓未來電腦運算就像是水、電一樣,只要連上網路就可以無限使用。 M.-S. Chen

雲端運算為台灣軟體產業帶來機會: 智慧生活創新應用服務 半導體產業 Semiconductor Industry 資訊軟體產業 Software Service Industry 類比 資 訊 設 備 / 裝 置 雲 端 設 備 / 裝 置 Foundries (TSMC, UMC) Cloud Computing IT device Cloud device Enable Enable Fabless Chip Design(e.g. nVidia…) 無須自建生產線的晶片設計 Datacenter-less SaaS provider 無須自建資料中心的軟體服務 Source: “Above the Clouds: A Berkeley View of Cloud Computing” Feb. 4, 2009 & Revision Opportunities in Smart Living 未來台灣資訊軟體公司可以不需要煩惱電 腦主機建置與維護,可以基於雲端資料中心 設施來直接開發軟體,供全世界用戶使用, 創造無限商機 例如: Zynga.com基於Facebook公司的雲端運算平 台,發展FarmVille(開心農場),於短短2年內創造1 億美金營收,打破Google紀錄 Smart food systems Intelligent oil field technologies Smart traffic systems Smart healthcare Smart energy grids Smart retail M.-S. Chen Smart supply chains Smart countrie s Smart weather Smart regions Smart cities Smart water management Source: IBM

台灣雲端運算利基市場與商機 (一)貨櫃式電腦系統: 傳統資料中心要建設上萬台電腦組成之雲端資料中心,涉及複雜的軟、硬體系統與水、電、冷卻設施整合,採用貨櫃式電腦系統,可加速雲端資料中心建設,以台灣伺服器產業出貨量世界第一的實力,有發展利基 (二)開放、安全雲端作業系統: 10萬台以上電腦的伺服器、儲存器、網路設備需要作業系統支援虛擬化、叢集化之整合運作,目前產業沒有技術標準,業者需要開放架構以避免被牽制,以台灣資訊安全軟體國際實力與法人軟體工程人才投入,可基於開放源碼,開發開放、安全雲端作業系統,提供雲端資料中心建置必備軟體 (三)雲加端創新應用服務: 台灣基於多元、優勢服務業知識基礎,加上政府推動智慧台灣與六大新興產業,可藉此利基發展各式提昇民眾生活水準之智慧生活創新應用服務,並藉雲端服務加上台灣終端裝置出貨量世界第一優勢,快速銷往國際 M.-S. Chen 11

Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 12

Software as a Service Google 所有的應用服務 網路Video分享平台 網路照片分享 Yahoo的應用服務. Other M.-S. Chen

Software as a Service (cont’d) Social Network 微網誌 Twitter, Plurk 網路布落格 Blog CRM, ERP 等企業應用服務 M.-S. Chen

Platform as a Service 主要提供API給用戶開發雲端版本的應用,以提供服務, 例如: Google AppEngine Facebook f8 platform facebook系統提出API,主要讓 第三方廠商使用facebook的SNS平台 ,節省開發成本集中在創意的應用 程式開發 M.-S. Chen

Platform as a Service (cont’d) Microsoft Azure 新一代的微軟雲端運算 平台,其包含Azure, SQL Azure和AppFabric IBM Pangoo Platform (盤古雲端服務平臺) 架在WAS、DB2與Tivoli等IBM產品上的平台即服務(PaaS)雲端架構 Amazon Web Services AWS is PaaS, EC2 is IaaS M.-S. Chen

Infrastructure as a Service 以Amazon 的EC2 (Elastic Compute Cloud) 最為著名 Pay as you go, 按執行的時間與數量有其計價模式 將資訊基礎設施透過虛擬化的平台整合的服務,用戶不需要採購伺服器、租用實際空間或網路,透過委外租用IaaS的方式取得所需要的資源 M.-S. Chen

Amazon Business Model EC2 On-Demand Instances 不同的CPU計算能力與記憶體需求有不同的收費 Default is 1.7 GB of memory, 1 EC2 Compute Unit 160 GB of local instance storage, 32-bit platform M.-S. Chen

Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 19

虛擬運算技術(支援運算資源分享能力) 虛擬化/虛擬運算技術(Virtualization) 是藉由一種對應方式 (virtual machine monitor, hypervisor, or virtualization layer),將一群硬體,例如:伺服器、儲存器,轉成虛擬裝置(devices),使不同種作業系統(operating system) 能共同使用這一群硬體。 Source: Mendel Rosenblum Stanford U., 1998 M.-S. Chen

叢集運算技術(支援高效能高延展運算能力) 將許多實體電腦(通常是相同規格),以網路連結,實現高延展與高效能的分散式運算(例如:Google Search) M.-S. Chen

軟體即服務供應平台技術 提供租用軟體之線上訂閱、認證、 依據用戶之付費方案,提供租用軟體之運算資源配置,與服務品質管理 、授權、使用紀錄/清算/計費等功能 依據用戶之付費方案,提供租用軟體之運算資源配置,與服務品質管理 Enterprise A Enterprise B Enterprise C Enterprise D On-demand Application On-demand Application On-demand Application On-demand Application Multi-tenancy Service Delivery Platform Runtime Access Control Order Management Management Agent Metering Provisioning Security Log Management Log Usage Tracking Identity Management Billing CRM Performance Availability Security SLA Monitoring Management Alerts Call Center Support System M.-S. Chen 圖來源:Microsoft 圖來源:Forrester Research

虛擬桌機與網路桌面技術 Virtual Desktop Web Desktop (Webtop) 終端軟體與資料可全部移往雲端資料中心實體/虛擬主機執行,透過串流(streaming)技術與豐富型網頁(RIA,Rich Internet Application)技術,以平價簡易終端,就可提供使用者目前相同於或超越於個人電腦桌面的使用體驗 Virtual Desktop Web Desktop (Webtop) http://g.ho.st M.-S. Chen

Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 24

Cloud Computing Related Work in NetDB Content search 2D bar code, watch (with CR Kung and H. Chi) Social network enhanced search/queries Mining for SaaS activities 團購 (with YH Hung) Sequential pattern mining on Cloud (with JW Huang) M.-S. Chen

S01 S02 S03 S04 S05 S06 A B C AD B C BD AD B A A BC B C A C D D C A BC SID t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 We can see there is a frequent sequential pattern A,B in the first POI, i.e., DB1,5. However, after t6, AB is no longer a frequent sequential pattern in any POI. If we do not delete obsolete items from the static or incremental database. AB will be regarded as an interesting pattern until t8, when S06 joins the database. In addition, S02 has no more item after t5. S02 should be deleted from the database accordingly. POI=5, min_supp=0.5 M.-S. Chen 26 26

Motivation for Mining on Cloud With the increasing amount of data, single processors struggle to scale up. Mining progressive sequential patterns intrinsically suffers from the scalability problem. When the number of sequences grows and the POI becomes larger, the time and space used to conduct progressive sequential patterns will increase dramatically. M.-S. Chen 27

Distributed Algorithms in Hadoop Platform Hadoop is an open source project aiming at building a cloud infrastructure running on large clusters for a huge amount of data. Hadoop platform implements Map/Reduce paradigm. Easy to scale up (adding more machines into the clusters) in Hadoop platform. Distributed algorithms devised lead to performance improvement on sequential pattern mining (PAKDD 2010) M.-S. Chen 28

Device OS (Android, WinCE, BIOS. .) Java/.NET Application Platform 資策會成立雲端服務技術中心 研發雲加端服務平台研發與從事「高附加價值」雲端應用服務的委託設計(Cloud Device-and-Service ODM),協助台灣資訊業者技術升級,進軍雲端應用服務國際市場。 雲(加)端服務平台 雲端資安 Phone E-Book NetBook Display 虛擬桌機與網路桌面 Device OS (Android, WinCE, BIOS. .) 文化創意 應用 醫療照護 應用 中小企業 應用 軟體即服務供應平台 Java/.NET Application Platform 分工說明 研發實驗用途之雲端資料中心 (高雄軟體園區) 業者 資策會 開放式雲端作業系統 外商 M.-S. Chen 一般商用伺服務器、儲存與網路設備 29

工研院成立雲端運算中心 研發「貨櫃式電腦(Container Computer)」及「雲端作業系統(Cloud OS)」,扶植台灣業者推出「整廠輸出」形式的雲端資料中心產品,進軍全球雲端運算市場。 Server 分工說明 Cloud Data Center Cloud OS Container Computer 業者 Storage 工研院 Cloud Data Center Network 外商 M.-S. Chen 30

Conclusion Elastic IT 時代就要來臨 Cloud computing will become prevalent in light of the following Drop of hardware cost Improvement of network/hardware speed Increase of storage (disk/memory) size Demand for ubiquitous applications Explosion of information and data (data mining) M.-S. Chen

Thank you! M.-S. Chen

Amazon Business Model EC2 Reserved Instances 若是要保留Instance, 則有不同的計價方法 M.-S. Chen

Amazon Business Model EC2 Data Transfer 資料傳輸有額外的計價, M.-S. Chen

Simple Storage Service (Amazon S3) S3是Amazon提供的線上儲存服務,主要針對有網路上空間使用需求的企業或者使用者所設,將儲存檔案用的線上空間租借給使用者 M.-S. Chen

Amazon EC2(Elastic Compute Cloud) Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings. Or use our pre-configured, templated images to get up and running immediately. Upload the AMI into Amazon S3. Amazon EC2 provides tools that make storing the AMI simple. Amazon S3 provides a safe, reliable and fast repository to store your images. Use Amazon EC2 web service to configure security and network access. Choose the type(s) of instance you want to run. Start, terminate, and monitor as many instances of your AMI as needed, using the web service APIs. Pay for the instance-hours and bandwidth that you actually consume. M.-S. Chen 36

Amazon EC2(Elastic Compute Cloud) Instances (per instance-hour ) $0.10 - Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform $0.40 - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $0.80 - Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform Data Transfer $0.10 per GB - all data transfer in $0.18 per GB - first 10 TB / month data transfer out $0.16 per GB - next 40 TB / month data transfer out $0.13 per GB - data transfer out / month over 50 TB M.-S. Chen

Cloud Computing Related Work in NetDB (cont’d) Infrastructure as a Service (IaaS) 現階段透過Eucalyptus架設使用者可執行並控制 virtual machine instances的環境, 可直接相容於EC2/S3的API工具. 暫時架設100左右的機器, 未來可能持續增加 M.-S. Chen

Mining on Cloud: Sequential Pattern Mining Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. M.-S. Chen 39

S01 S02 S03 S04 S05 S06 A B C AD B C BD AD B A A BC B C A C D D C A BC SID t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 We can see there is a frequent sequential pattern A,B in the first POI, i.e., DB1,5. However, after t6, AB is no longer a frequent sequential pattern in any POI. If we do not delete obsolete items from the static or incremental database. AB will be regarded as an interesting pattern until t8, when S06 joins the database. In addition, S02 has no more item after t5. S02 should be deleted from the database accordingly. POI=5, min_supp=0.5 M.-S. Chen 40 40

Motivation for Mining on Cloud With the increasing amount of data, single processors struggle to scale up. Mining progressive sequential patterns intrinsically suffers from the scalability problem. When the number of sequences grows and the POI becomes larger, the time and space used to conduct progressive sequential patterns will increase dramatically. M.-S. Chen 41

Distributed Mining Algorithm on Cloud We design a distributed mining algorithm on the cloud to address the scalability problem. Distributed Progressive Sequential Pattern mining algorithm is implemented on top of Hadoop platform, which realizes the cloud computing environment. In general, each Map/Reduce job in Hadoop model contains a mapper class and a reducer class. Users can assign any number of nodes in the cloud to be mappers and reducers. The control node splits input data into several portions and each portion is sent to a mapper in the cloud. After the execution of the map function, each mapper generates many <key, value> pairs as the output. Pairs with the same key are sent to the same reducer. Then, the reducer aggregates values related to the specific key and outputs different <key, value> pairs according to the reduce function. M.-S. Chen 42 42

Hadoop Platform Hadoop is an open source project aiming at building a cloud infrastructure running on large clusters to deal with a huge amount of data. Hadoop platform implements Google’s Map/Reduce paradigm. It is extremely easy to scale up (adding more machines into the clusters) in Hadoop platform. M.-S. Chen 43

Map/Reduce Paradigm By means of the map function, the application can be divided into several fractions. Each fraction is assigned to a single node in large clusters and executed by the node. After the execution, the reduce function merges these partial results to form the final output. M.-S. Chen 44

Cloud Computing Environment The cloud computing environment allows developers to focus on designing distributed algorithms, and offers great scalability Routine issues can be inherently handled by the cloud computing framework. E.g., data allocation, job scheduling, load balancing, failure recovery Lead to performance improvement on sequential pattern mining (PAKDD 2010) M.-S. Chen 45

Designs of DPSP We propose two Map/Reduce jobs in DPSP. CandidateComputingJob computes current candidate sequential patterns deletes obsolete itemsets update the summary of each sequence SupportAssemblingJob accumulates the occurrence frequencies of candidate sequential patterns report up-to-date frequent sequential patterns within each POI. M.-S. Chen 46

Algorithm DPSP Input: itemsets of all sequences arriving at the current timestamp 1. while (there is new data arriving at timestamp t ){ 2. CandidateComputingJob ; 3. SupportAssemblingJob ; 4. t = t + 1 ; 5. output frequent sequential patterns; 6. }end while output: frequent sequential patterns at each POI with their supports The candidate computing job reads input data, which arrives at timestamp t, of all sequences. Itemsets from different sequences are distributed to different nodes in the cloud computing environment. Each node in the cloud computes candidate sequential patterns of each sequence within the current POI. Meanwhile, the candidate computing job also updates the summary for each sequence. Obsolete data are deleted in the candidate computing job and the up-to-date candidate sequential patterns are output. Then, support assembling job reads all candidate sequential patterns as input data. Different candidate sequential patterns are distributed to different nodes. Each node accumulates the occurrence frequencies of candidate sequential patterns and reports frequent sequential patterns whose supports are no less than the minimum support threshold in the current POI to users. When time goes to the next timestamp, DPSP keeps executing these Map/Reduce jobs. M.-S. Chen 47 47

many <SeqNo, itemset> or Start DPSP input data at time t summaries at time t-1 CCMapper Candidate Calculating Job many <SeqNo, itemset> or <SeqNo, itemset + timestamp> pairs CCReducer In the candidate computing job, we design CCMapper as the mapper class and CCReducer as the reducer class. The input data at timestamp t and the candidate set summaries at timestamp t-1 are split and transferred to several CCMappers. According to the sequence number of input data, CCMapper generates many pairs of <sequence number, input itemset>. Then, pairs with the same sequence number are sent to the same CCReducer. CCReducer computes candidate sequential patterns of the given sequence and outputs pairs of <candidate sequential patterns, null>. In addition, CCReducer updates the summary of each sequence and deletes obsolete data at the same time. Candidate set summaries at the current timestamp are output for the computation at the next timestamp as well. <candidate itemset, null> summaries at time t M.-S. Chen 48 48

Example of CCJ … A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 BD S01 Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 Let’s take S01 as an example. A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … BD S01 M.-S. Chen 49 49

A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 BD S01 … Summary at t5 A5 B(AD)2 B2 CA4 C4 CD4 BC2 C(AD)4 D5 BCA2 (AD)5 BCD2 BA2 BC(AD)2 BD2 Summary at t6 A5 CA4 C4 CD4 D5 C(AD)4 (AD)5 Summary at t7 A5 CD4 DB5 C4 C(AD)4 (AD)B5 D5 B7 CAB4 (AD)5 AB5 CDB4 CA4 CB4 C(AD)B4 Result set at t6 BC2 CD4 BA2 C(AD)4 BD2 BCA2 B(AD)2 BCD2 CA4 BC(AD)2 Result set at t7 CA DB CD (AD)B C(AD) CAB AB CDB CB C(AD)B Assume POI is set as 5 timestamps and we are dealing with t6 and t7. The timestamps of candidate itemsets in the summary at t5 are all bigger than 2, which means these candidates are all valid at t6. At t6, there is no newly arriving itemset for S01. CCMappers read the summary at t5 and distributes candidate itemsets of S01 to a CCReducer. CCReducer outputs the candidate itemsets whose timestamps are bigger than 3 to the summary at. Meanwhile, CCReducer stores the candidate itemsets whose sizes are larger than 1 into the result set. When time moves to t7, S01 has a newly arriving itemset B. CCMappers distribute the summary at t6 and the arriving itemset B to a CCReducer. All candidate itemsets are stored in cand_set. The same as the process at t6, CCReducer outputs those itemsets with white background as the summary at t7. Meanwhile, CCReducer stores those candidate itemsets whose sizes are larger than 1 into the result set. The arriving itemset B is stored in the input data. B is appended to all candidate itemsets in the cand_set. The newly generated candidate itemsets are AB5, CB4, DB5, (AD)B5, CAB4, CDB4, and C(AD)B4. They are stored in the result set and output to the summary at t7. M.-S. Chen 50 50

<candidate itemset, null> summaries at time t SAMapper Support Assembling Job many <candidate, support> pairs SAReducer Next, the support assembling job takes the candidate sequential patterns output by the candidate computing job as its input data. SAMapper is the mapper class and SAReducer is the reduce class of the support assembling job. Each SAMapper reads input data and accumulates local occurrence frequencies for each candidate sequential patterns. After parsing all input data, SAMapper generates pairs of <candidate sequential pattern, local supports of the candidate>. Then, the pairs containing the same candidate sequential pattern are sent to the same SAReducer. SAReducer aggregates supports of the same candidate sequential pattern and outputs those patterns whose occurrence frequencies are not less than the minimum support threshold. The frequent sequential patterns in the current POI are generated. After the computation of the candidate computing job and the support assembling job at the timestamp t, DPSP moves to the next timestamp t+1. <frequent sequential patterns, support> t=t+1 M.-S. Chen 51 51

(2) (4) (5) Db1,2(4) AB 3 A(BC) 1 AC (AD)B DB Db1,4(5) AB 3 A(BC)BC 1 A(BC)C AC 2 (AD)A (AD)B (AD)BA DB BA A(BC)B BC ACB (BC)BC (BC)B (BC)C CB DA DC DBA ABC Db1,5(5) AB 3 ABC 2 DBA BCA 1 A(BC) A(BC)BC A(AD) BC(AD) AC A(BC)C AB(AD) BCD (AD)B (AD)A ABC(AD) BD DB (AD)BA ABCD CA A(BC)B BA ABD C(AD) ACB BC AC(AD) CD (BC)B (BC)BC ACD DCA CB (BC)C AD DC DA B(AD) AB(3) (3) Db1,3(5) AB 3 A(BC) 1 AC (AD)B DB A(BC)B ACB (BC)B CB DC AB(3) BA(3) DA(3) AB(3) Final results of SAJ at each timestamp (t). (2~5 in this page, 6~8 in the next slide) The frequent ones are output at the bottom of each table. For example, at t2, the output frequent sequential pattern is AB(3), which means AB has supports 3. M.-S. Chen AB(3) 52 52

(6) (7) (8) Db2,6(5) DB 1 BC(AD) (BC)B BCD CB BD DC CA 3 BA 4 C(AD) BC 2 CD (BC)BC DCA (BC)C (BC)A DA (BC)BA DBA (BC)BCA B(AD) (BC)CA BCA3 CBA Db3,7(5) DB 2 (AD)B 1 BA BAC BC CAB DA CA(BC) DBA C(AD)B BCA CB CA 3 C(BC) C(AD) CDB CD DAC AB DBAC A(BC) DBC AC DC Db4,8(6) DB 1 BAC BA CAB BC 2 C(AD)B CA CB C(AD) CDB CD ABC AB (AD)BC A(BC) (AD)C AC 4 DBC (AD)B DC AC(4) BA(4) CA(3) CA(3) M.-S. Chen 53

Conclusions We proposed a distributed algorithm DPSP to address the inevitable scalability problem of the progressive sequential pattern mining. We designed two Map/Reduce jobs in DPSP to efficiently compute candidate sequential patterns, update summaries of sequences, and assemble supports of candidate sequential patterns within each POI M.-S. Chen 54

Conclusions The experimental results show that DPSP possesses great scalability and thus increases practicability when the number of sequences become larger. By utilizing Hadoop platform, it is easy to increase the number of computing nodes in the cluster to acquire better performance. M.-S. Chen 55

Reference [1] Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen, A General Model for Sequential Pattern Mining with a Progressive Database, IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 9, pp. 1153-1167, 2008 (September) [2] http://hadoop.apache.org [3] J. Dean and S. Ghemawat. Mapreduce: Simplified dataprocessing on large clusters. Symp. on Operating System Design and Implementation, 2004. M.-S. Chen 56