Cloud Computing and its Technologies and Applications Ming-Syan Chen 陳銘憲 Director Research Center for Information Technology Innovation Academia Sinica March 19, 2010
Received talk request… What to talk? In an hour Be relevant
Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 3
雲端運算定義 (一)雲端運算是一種經由網路取得遠端電腦運算服務的商業模式與技術組成 1.資料(data)與軟體服務(service)移往網際網路上大型(通常為1萬台以上主機)、可延展(scalable)、共用(shared)的資料中心 2.雲端資料中心提供無限延展的運算、儲存與應用程式,使用者利用具備網際網路連線能力之電腦終端裝置(device),即等同擁有一部虛擬超級電腦 (二)未來電腦運算就像是水、電一樣,只要連上網路就可以使用,不必各自投資發展 M.-S. Chen
Cloud Operation System 未來雲端資料中心的樣貌與效益 Thin Client Cloud Operation System 資料來源:ISI INTERNATIONAL STRATEGY & INVESTMENT Source: www.spectrum.ieee.org 02/2009 M.-S. Chen
Why Cloud Computing Better resource sharing Smaller overall IT cost Better management of computing resources Easy resource access for users Better SW/AP dissemination Facilitating ubiquitous computing/LBS Prosperous in view of technology trend HW cost, network availability, value of SW, etc M.-S. Chen
Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 7
雲端運算創造 台灣資訊業高值化、服務化轉型升級契機 雲端運算科技,徹底改變資訊產業供應鏈樣貌與資訊科技應用方式,對資訊產業產生重大影響: (一)一般使用者的電腦終端,不需要安裝軟體,使用雲端軟體服務(SaaS, Software-as-a-Service),上網就可以使用軟體應用程式服務 (二)程式設計師不用安裝軟體應用程式開發軟體,使用雲端應用平台服務(PaaS, Platform-as-a-Service),上網就可以完成軟體程式系統開發 (三)企業不用自行建置機房,使用雲端資料中心服務 (IaaS, Infrastructure-as-a-Service),上網就可以購買主機運算與儲存空間,發展自己需要的資訊系統 台灣資通訊產業擅長製造,面對這一波雲端運算以軟體服務為主的競爭時代,台灣資訊產業必須轉型升級高附加價值的系統製造與應用服務發展。 M.-S. Chen
雲端運算為台灣硬體產業帶來機會: 雲端貨櫃型電腦 1980年代 – “ 每個人的桌上都有一台 個人電腦” 2010年代 – “ 每個人在行動中擁有一台 超級電腦” 3000 貨櫃型電腦 (雲端運算) Internet Billion (US$) 行動裝置 (行動運算) 提供行動服務 1000 個人電腦 (個人運算) 提供個人&辦公室應用 300 Mainframe (集中運算) 傳統企業ERP/MRP, 自動化等應用 100 1939 1980 1995 2005 2015 Year 「雲端貨櫃型電腦資料中心」讓未來電腦運算就像是水、電一樣,只要連上網路就可以無限使用。 M.-S. Chen
雲端運算為台灣軟體產業帶來機會: 智慧生活創新應用服務 半導體產業 Semiconductor Industry 資訊軟體產業 Software Service Industry 類比 資 訊 設 備 / 裝 置 雲 端 設 備 / 裝 置 Foundries (TSMC, UMC) Cloud Computing IT device Cloud device Enable Enable Fabless Chip Design(e.g. nVidia…) 無須自建生產線的晶片設計 Datacenter-less SaaS provider 無須自建資料中心的軟體服務 Source: “Above the Clouds: A Berkeley View of Cloud Computing” Feb. 4, 2009 & Revision Opportunities in Smart Living 未來台灣資訊軟體公司可以不需要煩惱電 腦主機建置與維護,可以基於雲端資料中心 設施來直接開發軟體,供全世界用戶使用, 創造無限商機 例如: Zynga.com基於Facebook公司的雲端運算平 台,發展FarmVille(開心農場),於短短2年內創造1 億美金營收,打破Google紀錄 Smart food systems Intelligent oil field technologies Smart traffic systems Smart healthcare Smart energy grids Smart retail M.-S. Chen Smart supply chains Smart countrie s Smart weather Smart regions Smart cities Smart water management Source: IBM
台灣雲端運算利基市場與商機 (一)貨櫃式電腦系統: 傳統資料中心要建設上萬台電腦組成之雲端資料中心,涉及複雜的軟、硬體系統與水、電、冷卻設施整合,採用貨櫃式電腦系統,可加速雲端資料中心建設,以台灣伺服器產業出貨量世界第一的實力,有發展利基 (二)開放、安全雲端作業系統: 10萬台以上電腦的伺服器、儲存器、網路設備需要作業系統支援虛擬化、叢集化之整合運作,目前產業沒有技術標準,業者需要開放架構以避免被牽制,以台灣資訊安全軟體國際實力與法人軟體工程人才投入,可基於開放源碼,開發開放、安全雲端作業系統,提供雲端資料中心建置必備軟體 (三)雲加端創新應用服務: 台灣基於多元、優勢服務業知識基礎,加上政府推動智慧台灣與六大新興產業,可藉此利基發展各式提昇民眾生活水準之智慧生活創新應用服務,並藉雲端服務加上台灣終端裝置出貨量世界第一優勢,快速銷往國際 M.-S. Chen 11
Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 12
Software as a Service Google 所有的應用服務 網路Video分享平台 網路照片分享 Yahoo的應用服務. Other M.-S. Chen
Software as a Service (cont’d) Social Network 微網誌 Twitter, Plurk 網路布落格 Blog CRM, ERP 等企業應用服務 M.-S. Chen
Platform as a Service 主要提供API給用戶開發雲端版本的應用,以提供服務, 例如: Google AppEngine Facebook f8 platform facebook系統提出API,主要讓 第三方廠商使用facebook的SNS平台 ,節省開發成本集中在創意的應用 程式開發 M.-S. Chen
Platform as a Service (cont’d) Microsoft Azure 新一代的微軟雲端運算 平台,其包含Azure, SQL Azure和AppFabric IBM Pangoo Platform (盤古雲端服務平臺) 架在WAS、DB2與Tivoli等IBM產品上的平台即服務(PaaS)雲端架構 Amazon Web Services AWS is PaaS, EC2 is IaaS M.-S. Chen
Infrastructure as a Service 以Amazon 的EC2 (Elastic Compute Cloud) 最為著名 Pay as you go, 按執行的時間與數量有其計價模式 將資訊基礎設施透過虛擬化的平台整合的服務,用戶不需要採購伺服器、租用實際空間或網路,透過委外租用IaaS的方式取得所需要的資源 M.-S. Chen
Amazon Business Model EC2 On-Demand Instances 不同的CPU計算能力與記憶體需求有不同的收費 Default is 1.7 GB of memory, 1 EC2 Compute Unit 160 GB of local instance storage, 32-bit platform M.-S. Chen
Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 19
虛擬運算技術(支援運算資源分享能力) 虛擬化/虛擬運算技術(Virtualization) 是藉由一種對應方式 (virtual machine monitor, hypervisor, or virtualization layer),將一群硬體,例如:伺服器、儲存器,轉成虛擬裝置(devices),使不同種作業系統(operating system) 能共同使用這一群硬體。 Source: Mendel Rosenblum Stanford U., 1998 M.-S. Chen
叢集運算技術(支援高效能高延展運算能力) 將許多實體電腦(通常是相同規格),以網路連結,實現高延展與高效能的分散式運算(例如:Google Search) M.-S. Chen
軟體即服務供應平台技術 提供租用軟體之線上訂閱、認證、 依據用戶之付費方案,提供租用軟體之運算資源配置,與服務品質管理 、授權、使用紀錄/清算/計費等功能 依據用戶之付費方案,提供租用軟體之運算資源配置,與服務品質管理 Enterprise A Enterprise B Enterprise C Enterprise D On-demand Application On-demand Application On-demand Application On-demand Application Multi-tenancy Service Delivery Platform Runtime Access Control Order Management Management Agent Metering Provisioning Security Log Management Log Usage Tracking Identity Management Billing CRM Performance Availability Security SLA Monitoring Management Alerts Call Center Support System M.-S. Chen 圖來源:Microsoft 圖來源:Forrester Research
虛擬桌機與網路桌面技術 Virtual Desktop Web Desktop (Webtop) 終端軟體與資料可全部移往雲端資料中心實體/虛擬主機執行,透過串流(streaming)技術與豐富型網頁(RIA,Rich Internet Application)技術,以平價簡易終端,就可提供使用者目前相同於或超越於個人電腦桌面的使用體驗 Virtual Desktop Web Desktop (Webtop) http://g.ho.st M.-S. Chen
Outline Introduction to Cloud Computing Taiwan’s opportunity Cloud applications Technology issues Work in NetDB Conclusion Good morning and welcome you all to IIS! M.-S. Chen 24
Cloud Computing Related Work in NetDB Content search 2D bar code, watch (with CR Kung and H. Chi) Social network enhanced search/queries Mining for SaaS activities 團購 (with YH Hung) Sequential pattern mining on Cloud (with JW Huang) M.-S. Chen
S01 S02 S03 S04 S05 S06 A B C AD B C BD AD B A A BC B C A C D D C A BC SID t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 We can see there is a frequent sequential pattern A,B in the first POI, i.e., DB1,5. However, after t6, AB is no longer a frequent sequential pattern in any POI. If we do not delete obsolete items from the static or incremental database. AB will be regarded as an interesting pattern until t8, when S06 joins the database. In addition, S02 has no more item after t5. S02 should be deleted from the database accordingly. POI=5, min_supp=0.5 M.-S. Chen 26 26
Motivation for Mining on Cloud With the increasing amount of data, single processors struggle to scale up. Mining progressive sequential patterns intrinsically suffers from the scalability problem. When the number of sequences grows and the POI becomes larger, the time and space used to conduct progressive sequential patterns will increase dramatically. M.-S. Chen 27
Distributed Algorithms in Hadoop Platform Hadoop is an open source project aiming at building a cloud infrastructure running on large clusters for a huge amount of data. Hadoop platform implements Map/Reduce paradigm. Easy to scale up (adding more machines into the clusters) in Hadoop platform. Distributed algorithms devised lead to performance improvement on sequential pattern mining (PAKDD 2010) M.-S. Chen 28
Device OS (Android, WinCE, BIOS. .) Java/.NET Application Platform 資策會成立雲端服務技術中心 研發雲加端服務平台研發與從事「高附加價值」雲端應用服務的委託設計(Cloud Device-and-Service ODM),協助台灣資訊業者技術升級,進軍雲端應用服務國際市場。 雲(加)端服務平台 雲端資安 Phone E-Book NetBook Display 虛擬桌機與網路桌面 Device OS (Android, WinCE, BIOS. .) 文化創意 應用 醫療照護 應用 中小企業 應用 軟體即服務供應平台 Java/.NET Application Platform 分工說明 研發實驗用途之雲端資料中心 (高雄軟體園區) 業者 資策會 開放式雲端作業系統 外商 M.-S. Chen 一般商用伺服務器、儲存與網路設備 29
工研院成立雲端運算中心 研發「貨櫃式電腦(Container Computer)」及「雲端作業系統(Cloud OS)」,扶植台灣業者推出「整廠輸出」形式的雲端資料中心產品,進軍全球雲端運算市場。 Server 分工說明 Cloud Data Center Cloud OS Container Computer 業者 Storage 工研院 Cloud Data Center Network 外商 M.-S. Chen 30
Conclusion Elastic IT 時代就要來臨 Cloud computing will become prevalent in light of the following Drop of hardware cost Improvement of network/hardware speed Increase of storage (disk/memory) size Demand for ubiquitous applications Explosion of information and data (data mining) M.-S. Chen
Thank you! M.-S. Chen
Amazon Business Model EC2 Reserved Instances 若是要保留Instance, 則有不同的計價方法 M.-S. Chen
Amazon Business Model EC2 Data Transfer 資料傳輸有額外的計價, M.-S. Chen
Simple Storage Service (Amazon S3) S3是Amazon提供的線上儲存服務,主要針對有網路上空間使用需求的企業或者使用者所設,將儲存檔案用的線上空間租借給使用者 M.-S. Chen
Amazon EC2(Elastic Compute Cloud) Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings. Or use our pre-configured, templated images to get up and running immediately. Upload the AMI into Amazon S3. Amazon EC2 provides tools that make storing the AMI simple. Amazon S3 provides a safe, reliable and fast repository to store your images. Use Amazon EC2 web service to configure security and network access. Choose the type(s) of instance you want to run. Start, terminate, and monitor as many instances of your AMI as needed, using the web service APIs. Pay for the instance-hours and bandwidth that you actually consume. M.-S. Chen 36
Amazon EC2(Elastic Compute Cloud) Instances (per instance-hour ) $0.10 - Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform $0.40 - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $0.80 - Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform Data Transfer $0.10 per GB - all data transfer in $0.18 per GB - first 10 TB / month data transfer out $0.16 per GB - next 40 TB / month data transfer out $0.13 per GB - data transfer out / month over 50 TB M.-S. Chen
Cloud Computing Related Work in NetDB (cont’d) Infrastructure as a Service (IaaS) 現階段透過Eucalyptus架設使用者可執行並控制 virtual machine instances的環境, 可直接相容於EC2/S3的API工具. 暫時架設100左右的機器, 未來可能持續增加 M.-S. Chen
Mining on Cloud: Sequential Pattern Mining Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. M.-S. Chen 39
S01 S02 S03 S04 S05 S06 A B C AD B C BD AD B A A BC B C A C D D C A BC SID t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 We can see there is a frequent sequential pattern A,B in the first POI, i.e., DB1,5. However, after t6, AB is no longer a frequent sequential pattern in any POI. If we do not delete obsolete items from the static or incremental database. AB will be regarded as an interesting pattern until t8, when S06 joins the database. In addition, S02 has no more item after t5. S02 should be deleted from the database accordingly. POI=5, min_supp=0.5 M.-S. Chen 40 40
Motivation for Mining on Cloud With the increasing amount of data, single processors struggle to scale up. Mining progressive sequential patterns intrinsically suffers from the scalability problem. When the number of sequences grows and the POI becomes larger, the time and space used to conduct progressive sequential patterns will increase dramatically. M.-S. Chen 41
Distributed Mining Algorithm on Cloud We design a distributed mining algorithm on the cloud to address the scalability problem. Distributed Progressive Sequential Pattern mining algorithm is implemented on top of Hadoop platform, which realizes the cloud computing environment. In general, each Map/Reduce job in Hadoop model contains a mapper class and a reducer class. Users can assign any number of nodes in the cloud to be mappers and reducers. The control node splits input data into several portions and each portion is sent to a mapper in the cloud. After the execution of the map function, each mapper generates many <key, value> pairs as the output. Pairs with the same key are sent to the same reducer. Then, the reducer aggregates values related to the specific key and outputs different <key, value> pairs according to the reduce function. M.-S. Chen 42 42
Hadoop Platform Hadoop is an open source project aiming at building a cloud infrastructure running on large clusters to deal with a huge amount of data. Hadoop platform implements Google’s Map/Reduce paradigm. It is extremely easy to scale up (adding more machines into the clusters) in Hadoop platform. M.-S. Chen 43
Map/Reduce Paradigm By means of the map function, the application can be divided into several fractions. Each fraction is assigned to a single node in large clusters and executed by the node. After the execution, the reduce function merges these partial results to form the final output. M.-S. Chen 44
Cloud Computing Environment The cloud computing environment allows developers to focus on designing distributed algorithms, and offers great scalability Routine issues can be inherently handled by the cloud computing framework. E.g., data allocation, job scheduling, load balancing, failure recovery Lead to performance improvement on sequential pattern mining (PAKDD 2010) M.-S. Chen 45
Designs of DPSP We propose two Map/Reduce jobs in DPSP. CandidateComputingJob computes current candidate sequential patterns deletes obsolete itemsets update the summary of each sequence SupportAssemblingJob accumulates the occurrence frequencies of candidate sequential patterns report up-to-date frequent sequential patterns within each POI. M.-S. Chen 46
Algorithm DPSP Input: itemsets of all sequences arriving at the current timestamp 1. while (there is new data arriving at timestamp t ){ 2. CandidateComputingJob ; 3. SupportAssemblingJob ; 4. t = t + 1 ; 5. output frequent sequential patterns; 6. }end while output: frequent sequential patterns at each POI with their supports The candidate computing job reads input data, which arrives at timestamp t, of all sequences. Itemsets from different sequences are distributed to different nodes in the cloud computing environment. Each node in the cloud computes candidate sequential patterns of each sequence within the current POI. Meanwhile, the candidate computing job also updates the summary for each sequence. Obsolete data are deleted in the candidate computing job and the up-to-date candidate sequential patterns are output. Then, support assembling job reads all candidate sequential patterns as input data. Different candidate sequential patterns are distributed to different nodes. Each node accumulates the occurrence frequencies of candidate sequential patterns and reports frequent sequential patterns whose supports are no less than the minimum support threshold in the current POI to users. When time goes to the next timestamp, DPSP keeps executing these Map/Reduce jobs. M.-S. Chen 47 47
many <SeqNo, itemset> or Start DPSP input data at time t summaries at time t-1 CCMapper Candidate Calculating Job many <SeqNo, itemset> or <SeqNo, itemset + timestamp> pairs CCReducer In the candidate computing job, we design CCMapper as the mapper class and CCReducer as the reducer class. The input data at timestamp t and the candidate set summaries at timestamp t-1 are split and transferred to several CCMappers. According to the sequence number of input data, CCMapper generates many pairs of <sequence number, input itemset>. Then, pairs with the same sequence number are sent to the same CCReducer. CCReducer computes candidate sequential patterns of the given sequence and outputs pairs of <candidate sequential patterns, null>. In addition, CCReducer updates the summary of each sequence and deletes obsolete data at the same time. Candidate set summaries at the current timestamp are output for the computation at the next timestamp as well. <candidate itemset, null> summaries at time t M.-S. Chen 48 48
Example of CCJ … A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 BD S01 Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 Let’s take S01 as an example. A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … BD S01 M.-S. Chen 49 49
A B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 BD S01 … Summary at t5 A5 B(AD)2 B2 CA4 C4 CD4 BC2 C(AD)4 D5 BCA2 (AD)5 BCD2 BA2 BC(AD)2 BD2 Summary at t6 A5 CA4 C4 CD4 D5 C(AD)4 (AD)5 Summary at t7 A5 CD4 DB5 C4 C(AD)4 (AD)B5 D5 B7 CAB4 (AD)5 AB5 CDB4 CA4 CB4 C(AD)B4 Result set at t6 BC2 CD4 BA2 C(AD)4 BD2 BCA2 B(AD)2 BCD2 CA4 BC(AD)2 Result set at t7 CA DB CD (AD)B C(AD) CAB AB CDB CB C(AD)B Assume POI is set as 5 timestamps and we are dealing with t6 and t7. The timestamps of candidate itemsets in the summary at t5 are all bigger than 2, which means these candidates are all valid at t6. At t6, there is no newly arriving itemset for S01. CCMappers read the summary at t5 and distributes candidate itemsets of S01 to a CCReducer. CCReducer outputs the candidate itemsets whose timestamps are bigger than 3 to the summary at. Meanwhile, CCReducer stores the candidate itemsets whose sizes are larger than 1 into the result set. When time moves to t7, S01 has a newly arriving itemset B. CCMappers distribute the summary at t6 and the arriving itemset B to a CCReducer. All candidate itemsets are stored in cand_set. The same as the process at t6, CCReducer outputs those itemsets with white background as the summary at t7. Meanwhile, CCReducer stores those candidate itemsets whose sizes are larger than 1 into the result set. The arriving itemset B is stored in the input data. B is appended to all candidate itemsets in the cand_set. The newly generated candidate itemsets are AB5, CB4, DB5, (AD)B5, CAB4, CDB4, and C(AD)B4. They are stored in the result set and output to the summary at t7. M.-S. Chen 50 50
<candidate itemset, null> summaries at time t SAMapper Support Assembling Job many <candidate, support> pairs SAReducer Next, the support assembling job takes the candidate sequential patterns output by the candidate computing job as its input data. SAMapper is the mapper class and SAReducer is the reduce class of the support assembling job. Each SAMapper reads input data and accumulates local occurrence frequencies for each candidate sequential patterns. After parsing all input data, SAMapper generates pairs of <candidate sequential pattern, local supports of the candidate>. Then, the pairs containing the same candidate sequential pattern are sent to the same SAReducer. SAReducer aggregates supports of the same candidate sequential pattern and outputs those patterns whose occurrence frequencies are not less than the minimum support threshold. The frequent sequential patterns in the current POI are generated. After the computation of the candidate computing job and the support assembling job at the timestamp t, DPSP moves to the next timestamp t+1. <frequent sequential patterns, support> t=t+1 M.-S. Chen 51 51
(2) (4) (5) Db1,2(4) AB 3 A(BC) 1 AC (AD)B DB Db1,4(5) AB 3 A(BC)BC 1 A(BC)C AC 2 (AD)A (AD)B (AD)BA DB BA A(BC)B BC ACB (BC)BC (BC)B (BC)C CB DA DC DBA ABC Db1,5(5) AB 3 ABC 2 DBA BCA 1 A(BC) A(BC)BC A(AD) BC(AD) AC A(BC)C AB(AD) BCD (AD)B (AD)A ABC(AD) BD DB (AD)BA ABCD CA A(BC)B BA ABD C(AD) ACB BC AC(AD) CD (BC)B (BC)BC ACD DCA CB (BC)C AD DC DA B(AD) AB(3) (3) Db1,3(5) AB 3 A(BC) 1 AC (AD)B DB A(BC)B ACB (BC)B CB DC AB(3) BA(3) DA(3) AB(3) Final results of SAJ at each timestamp (t). (2~5 in this page, 6~8 in the next slide) The frequent ones are output at the bottom of each table. For example, at t2, the output frequent sequential pattern is AB(3), which means AB has supports 3. M.-S. Chen AB(3) 52 52
(6) (7) (8) Db2,6(5) DB 1 BC(AD) (BC)B BCD CB BD DC CA 3 BA 4 C(AD) BC 2 CD (BC)BC DCA (BC)C (BC)A DA (BC)BA DBA (BC)BCA B(AD) (BC)CA BCA3 CBA Db3,7(5) DB 2 (AD)B 1 BA BAC BC CAB DA CA(BC) DBA C(AD)B BCA CB CA 3 C(BC) C(AD) CDB CD DAC AB DBAC A(BC) DBC AC DC Db4,8(6) DB 1 BAC BA CAB BC 2 C(AD)B CA CB C(AD) CDB CD ABC AB (AD)BC A(BC) (AD)C AC 4 DBC (AD)B DC AC(4) BA(4) CA(3) CA(3) M.-S. Chen 53
Conclusions We proposed a distributed algorithm DPSP to address the inevitable scalability problem of the progressive sequential pattern mining. We designed two Map/Reduce jobs in DPSP to efficiently compute candidate sequential patterns, update summaries of sequences, and assemble supports of candidate sequential patterns within each POI M.-S. Chen 54
Conclusions The experimental results show that DPSP possesses great scalability and thus increases practicability when the number of sequences become larger. By utilizing Hadoop platform, it is easy to increase the number of computing nodes in the cluster to acquire better performance. M.-S. Chen 55
Reference [1] Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen, A General Model for Sequential Pattern Mining with a Progressive Database, IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 9, pp. 1153-1167, 2008 (September) [2] http://hadoop.apache.org [3] J. Dean and S. Ghemawat. Mapreduce: Simplified dataprocessing on large clusters. Symp. on Operating System Design and Implementation, 2004. M.-S. Chen 56