The Growth of a Data Scientist 我的資料科學之路 李 育 杰 Data Science and Machine Intelligence Lab 國立交通大學應用數學系 台灣資料科學年會 July 14-17, 2016
Big Data 3 V
台灣資料科學年會演講三要 要有趣 要有料 要有用
Agenda From Data Mining to Big Data Some my experiences in Data Science Breast Cancer Diagnosis and Prognosis Malicious URLs Detection 露天拍賣詐騙商品偵測 … Final Remarks
Breast Cancer Diagnosis and Prognosis
Cell Nuclei of a Fine Needle Aspirate 電子顯微鏡下的組織液細胞 Nuclear feature extraction for breast tumor diagnosis, WN Street, WH Wolberg, OL Mangasarian IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, 861-870
從電子顯微鏡影像到細胞特徵
Breast Cancer Diagnosis via SVM 97% Ten-fold Cross Validation Correctness 780 Patients: 494 Benign, 286 Malignant
Who will be benefitted from Chemotherapy?
Survival Curves for Overall Patients: w./wo. Chemotherapy
Overall Clustering Process 253 Patients (113 NoChemo, 140 Chemo) Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Poor1: Lymph>=5 OR Tumor>=4 Compute Initial Cluster Centers Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor Good Intermediate Poor
Survival Curves for Good, Intermediate & Poor Groups
Survival Curves for Intermediate Group: Split by Chemo & NoChemo
The Lessons I Learned Privacy is an issue Working with domain experts is EXTREMELY important Y2K 少也賤,故能多鄙事 Breast cancer survival and chemotherapy: a support vector machine analysis YJ Lee, OL Mangasarian, WH Wolberg, Discrete Math. Problem with Medical Application, DIMACS Workshop Survival-time classification of breast cancer patients YJ Lee, OL Mangasarian, WH Wolberg Computational Optimization and Applications 25 (1-3), 151-166
Malicious URLs Detection Can you filter out the benign URLs ONLY based on the URL stream? Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence
Malicious Websites Malicious websites have become tools for spreading criminal activity on the Web Phishing Malware (a) paypal.com (b) paypal.com-us.cgi-bin-webscr... Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013
Defences of Malicious URLs Blacklist service PhishTank Spam and Open Relay Blocking System (SORBS) Real-time URI Blacklist (URIBL) Malicious URLs detection Google Google safe browsing Trend Micro Web Reputation Service Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence
Why Need a Filtering Mechanism? URL requests received from users all over the world 3,000,000,000 ~ 7,000,000,000 per day 200,000,000 ~ 800,000,000 need to be analyzed Only 0.01% are malicious URLs Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information Filtering Mechanism 09/10/2013 Lab of Data Science & Machine Intelligence
Requirements from Industry No page content need for prioritization Prioritization means to return the most suspicious URLs No host based information is allowed Effectiveness Filtering (Download) Rate = Filtered URLs/Total URLs < 25% Malicious Coverage = Filtered Malicious URLs/ Total Malicious URLs > 75% Performance – Filtering > 2000 URLs per second for 1 dual-core VM with 4GB memory. Scalability One hour data should be consumed in one hour Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information 09/10/2013 Lab of Data Science & Machine Intelligence
Big Challenges Large scale data streaming One million URLs will be received per hour in average High dimension and sparse presentation Lexical information makes feature vector to become very sparse Extremely imbalanced data set Only contains about 0.01% malicious URLs Malicious URLs usually have very short life time Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence
Finding a Needle in a Haystack Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence
Our Main Results Malicious URLs Covering Rate 90% 75%, requirement Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 25%, if uniformly random Filtering Rate ≈ False Positive Rate 09/10/2013 Lab of Data Science & Machine Intelligence
Feature Extraction Limitations: Can not use Host-based information Web page content information Two types of feature sets are proposed Lexical features Descriptive features Two type of feature sets are proposed as different views for inspecting received URL 09/10/2013 Lab of Data Science & Machine Intelligence
Lexical Features: Information of Words The words in a URL string are translated into a Boolean vector Each Boolean value represents the occurrence of specific word Each URL component is split by specific delimiters and the words will be saved in a dictionary 09/10/2013 Lab of Data Science & Machine Intelligence
Lexical Features (cont.) Three character length sliding window on the domain name For the malicious websites which slightly modify its domain name For reducing memory usage of dictionary: Remove zero-weight words Remove word form argument value Replace IP with AS number (Using static mapping table) Replace the digits in word with regular expression Example: replace cool567 to cool[0-9]+ Keep the words generated in the last 24 hours only 09/10/2013 Lab of Data Science & Machine Intelligence
Descriptive Features: Static Characteristics of URL String Descriptive features observed from malicious websites For detecting the phishing websites A Digit between Two Letters (LDL) Examp1e A Letter between Two Digits (DLD) award2o12 For detecting malware website Executable File or Not Descriptive features are not easily changed by modifying the URL. 09/10/2013 Lab of Data Science & Machine Intelligence
Descriptive Features (cont.) Fraction of domain name Categorizing characters to letters, digits and symbols Splitting domain name by the connection of different categories Summing of longest token length of each category and divides by the domain name length 09/10/2013 Lab of Data Science & Machine Intelligence
Descriptive Features (cont.) For detecting randomly generated string Alphabet Entropy Number Rate For detecting abnormal phenomenon in URL string Length Length Ratio Letter, digit and symbol count. For detecting the common way on URL Using IP as Domain Name Default Port Number For covering the Sparse Features Delimiter Count The Length of Longest Word 09/10/2013 Lab of Data Science & Machine Intelligence
Collaborative Filtering Models We choose two online learning algorithms to update the model for Saving processing time and memory usage Adjusting model from concept drift of data streaming Two prospects of features to build two filters For descriptive features Passive-aggressive algorithm For lexical features Confident weighted algorithm Over-sampling technique for extremely imbalanced data set The online learning algorithm is a memory-efficient machine learning algorithm Different with batch learning, online learning algorithm doesn't need to keep old instances in memory for training. Usually be used in the large-scale problem 09/10/2013 Lab of Data Science & Machine Intelligence
Training Process 09/10/2013 Lab of Data Science & Machine Intelligence
Prediction Process 09/10/2013 Lab of Data Science & Machine Intelligence
Evaluation Data Set Measure Download Rate (DR) (TP + FP) / # of instances Missing Malicious Rate (MMR) FN / (TP + FN) 09/10/2013 Lab of Data Science & Machine Intelligence
Evaluation: Efficiency Environment CPU : Dual-core (3.00GHz) Memory : 4GB OS : Cent OS 64 bit Results For the security company, They use Intel(R) Xeon(TM) CPU 3.00GHz, 8 cores And they consumed 200~400 MB memory (so do we, depends on the dictionary size) With this environment, they can deal with: Average samples per hour: 1.6 million, about 10% of total traffic, takes 21.5 min on average And their results are all around 30% (both DR & MMR) 09/10/2013 Lab of Data Science & Machine Intelligence
Evaluation: Performance Settings Use one hour data for training/updating Predict next hour and record the results Compute the daily average of DR & MMR Apr. 09/10/2013 Lab of Data Science & Machine Intelligence
Evaluation: Performance Sep. Nov.-Dec. 09/10/2013
Why Collaborative Filtering Works? Both of two filters have a certain accuracy However, their results are different Set the download rate for each filter around 10% Apr. Average MMR of Descriptive Filter Average MMR of Lexical Filter 09/10/2013 Lab of Data Science & Machine Intelligence
The Lessons I Learned How to convert URLs stream into n-dimensional vector space How to deal with extremely unbalanced data Brain storming to define and extract features Choose a right learning algorithm How to deal with industry Malicious URL filtering -A big data application MS Lin, CY Chiu, YJ Lee, HK Pao, Big Data, 2013 IEEE International Conference on, 589-596
露天拍賣詐騙商品偵測
Lab of Data Science & Machine Intelligence 露天個案研究 利用機器學習輔助審查人員偵測詐騙商品 2018/11/10 Lab of Data Science & Machine Intelligence
露天 is the Top 1 in Shopping Category
人怕出名豬怕肥 露天這麼厲害一定也會有麻煩找上門
詐騙事件層出不窮 沒錯 露天就成為詐騙集團的下手目標
165 警政署高風險場排行榜 104年1至12月前10名高風險賣場 165警政署統計資料 排名 賣場名稱 總件數 1 露天拍賣 1418 2 86 小舖 824 3 SHOPPING99 804 4 奇摩拍賣 674 5 HITO本舖 592 6 小三美日 462 7 金石堂網路書店 368 8 衣芙日系 349 9 奇摩超級商城 317 10 樂天 217
網路電商都要打假、防詐騙
露天有甚麼機制防範詐騙? 那露天有沒有什麼機制去防範詐騙發生? 有的
在露天上架時所有商品會經過一個篩選機制被選到的商品會通過審核人員的審核 進入到商品頁面的商品如果有善心人士發現是可疑的詐騙商品就可以在商品頁面上檢舉或是直接到客服頁面檢舉 由客服來下架可疑詐騙商品 或是如果不幸有詐騙商品流到商品頁面騙到消費者 消費者也可以檢舉讓客服處理 所以在這個環節中最重要的就是這個篩選機制
檢測專家遇到的挑戰 上架商品數量百萬等級 如果24小時全年無休一個人一秒看一個商品也要十個人才能消化每天全部上架的商品
如何聰明地篩選需檢查的商品? Machine Learning 所以我們需要一個強化版的篩選機制
Major Challenges 極端不平衡的資料 詐騙集團策略會改變 好人佔大多數極少數壞人 詐騙商品的多樣性 但壞人會上傳大量詐騙商品 詐騙商品會隨季節流行改變 收集與前處理資料困難 釐清需要甚麼資料 需要的資料散佈在不同資料來源 露天傳統資料儲存系統無法滿足所需操作 資料量龐大 每天有百萬級商品上傳 需要即時性處理 但事情沒像想像中那麼簡單
Get your Hands Dirty 訪談審核人員、確認SOP 確認所需資料來源 確認資料流 從不同資料來源中萃取有用資料 從審核人員累積的 (Label) 資料中建模
審核專家經驗 – 帳號 以使用者帳號為例,詐騙集團假冒正當賣家的方式 re9n6LG0 4JEMxCRFwW jiJo4tpK 7FCXfrgO 2Ntla 5YsJSafSg8 jcshgx 12pm1c 以使用者帳號為例,詐騙集團假冒正當賣家的方式 當露天帳號申請門檻不高時,詐騙集團會大量申請殭屍帳號,帳號通常會像亂數,由程式產生 經由各種手段盜用正常使用者帳號,被盜用的使用者通常為長久無使用的帳號,所以可比對最後一次登入日期與上架日期判定
審核專家經驗 – 付款方式與地點 以商品所在地為例詐騙集團避免面交所以常常把商品所在設在較偏遠地區 也不喜歡貨到付款這種付款方式 ip 該如何使用?
確認資料來源與資料流
系統架構 露天主要資料庫 Log檔案 原始資料庫鏡像 用戶畫像 用戶特徵、商品特徵 額外應用 預測模型 Labeling 檢測專家
如何看一件商品?
Data Set Training set: (251827, 20) Testing set: (109122, 20) 上架日期 商品類別 商品數目 付款方式 商品價格 商品地點 商品運送方式 上架方式 使用者帳號 有無認證 ip 上次上架日期 是否第一次上架 1467043200 生活、居家 999 信用卡 520 台南市 7-11取貨 單品上架 xxxxx 有 xxx.xx.xxx.xxx 2016-06-27 23:59:59 N 手機、通訊 支付連 1461 台北市 xxxxxxx 無 休閒旅遊 50 貨到付款 19 xxx xx.xxx.xxx.xxx 2016-06-27 23:59:23 實際數據如下 為何最後會選擇LR是因為我們實驗過後發現LR的效果最好
Leading Board AUC
Competition Results Model Training time: 5~10 min Logistic Regression SVM Gradient Boost Decision Tree Naïve Bayes Training time: 5~10 min Testing time: 1~2 min
從工人智慧到人工智慧 審核人員僅能檢查相對少數的資料,即使焚膏繼晷 Sampling scheme 造成漏網之魚 數個月下來,累積的Labelled Data,應該發揮功用! We used these labelled data as the “ground truth” Our model can achieve 0.983 AUC 1,000 items can be examined in a second There is a hope to examine EVERY item
Keep the Experts in the Loop 審核人員對我們依然重要! We use our model to rank the “suspicious level” of item We can let 審核人員 examine the “gray” items Get the new labelled items from 審核人員 We can re-train our model with new labelled data Tango with bad guys
Data Science Team in Ruten We are hiring!
Strongly Recommend Books
Final Remarks If you torture the data long enough, it will confess to anything -Ronald H. Coase You data analytics results can not be beyond the nature rules Working with domain experts is very important Knowing the algorithms you used Be eager for data
Questions?
Thank You!
Social Network Services Part of Data Source from Cyberspace
Your Smart Home Part of Data Source from Physical World I am watching you! http://couturedigital.com/Wordpress/wp-content/uploads/2011/03/23-Zone-Smart-Home-Hertfordshire.jpg