The Growth of a Data Scientist 我的資料科學之路

Slides:



Advertisements
Similar presentations
《互联网运营管理》系列课程 觉浅网 荣誉出品
Advertisements

Course 1 演算法: 效率、分析與量級 Algorithms: Efficiency, Analysis, and Order
宏 观 经 济 学 N.Gregory Mankiw 上海杉达学院.
METAEDGE Corporation Taiwan
Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取
How can we become good leamers
二維品質模式與麻醉前訪視滿意度 中文摘要 麻醉前訪視,是麻醉醫護人員對病患提供麻醉相關資訊與服務,並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式,設計病患滿意度問卷,探討麻醉前訪視內容與病患滿意度之關係,以期分析關鍵品質要素為何,作為提高病患對醫療滿意度之參考。 本研究於台灣北部某醫學中心,通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患,其中實驗組共107位病患,在麻醉醫師訪視之前,安排先觀看麻醉流程衛教影片;另外對照組111位病患,則未提供衛教影片。問卷於麻醉醫師
BOTNET Detection and Prevention
Web of Science新平台纵览 Jan. 2014
資料庫設計 Database Design.
OMC 商業智庫 劉老師講題大綱 參考資料.
train n. 火车 subway n. 地铁 bus n. 公共汽车 bike n. 自行车.
SHARE with YOU Why am I here? (堅持……) What did I do?
Unit 5 Dialogues Detailed Study of Dialogues (对话) Exercises(练习)
Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者:陳宜樺 報告日期:2015/9/25.
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
深層學習 暑期訓練 (2017).
Homework 4 an innovative design process model TEAM 7
Module 5 Shopping 第2课时.
Some Effective Techniques for Naive Bayes Text Classification
Applications of Digital Signal Processing
Rate and Distortion Optimization for Reversible Data Hiding Using Multiple Histogram Shifting Source: IEEE Transactions On Cybernetics, Vol. 47, No. 2,February.
Population proportion and sample proportion
International Conference ITIE2010: Inspiration from Best Practices
異質計算教學課程內容 「異質計算」種子教師研習營 洪士灝 國立台灣大學資訊工程學系
計算方法設計與分析 Design and Analysis of Algorithms 唐傳義
初二英语写作课 课件 福建省闽清县第一中 王国豪
Source: IEEE Access, vol. 5, pp , October 2017
Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)
Journal Citation Reports® 期刊引文分析報告的使用和檢索
肢體殘障人士 Physically handicapped
Digital Terrain Modeling
Faculty of Arts Lingnan University 嶺南大學文學院
巨量資料分析與應用 (1) 楊立偉教授 台大工管系暨商研所 2014 Fall.
China Standardization activities of ITS
第5章 資料倉儲的資料建置.
圖表製作 集中指標 0628 統計學.
This Is English 3 双向视频文稿.
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
塑膠材料的種類 塑膠在模具內的流動模式 流動性質的影響 溫度性質的影響
客户服务 询盘惯例.
Source: IEEE Transactions on Image Processing, Vol. 25, pp ,
Unit 1.
基于课程标准的校本课程教学研究 乐清中学 赵海霞.
英语教学课件 九年级全.
谈模式识别方法在林业管理问题中的应用 报告人:管理工程系 马宁 报告地点:学研B107
Version Control System Based DSNs
Maintaining Frequent Itemsets over High-Speed Data Streams
Guide to a successful PowerPoint design – simple is best
Ericsson Innovation Award 2018 爱立信创新大赛 2018
虚 拟 仪 器 virtual instrument
Common Qs Regarding Earnings
Course 4 分類與預測 Classification and Prediction
Cisco Troubleshooting and Maintaining Cisco IP Networks (TSHOOT)
Unit 7 Lesson 20 九中分校 刘秀芬.
2008 TIME USE SURVEY IN CHINA
Inter-band calibration for atmosphere
系统科学与复杂网络初探 刘建国 上海理工大学管理学院
美國亞利桑納州Eurofresh農場的晨曦
BiCuts: A fast packet classification algorithm using bit-level cutting
李宏毅專題 Track A, B, C 的時間、地點開學前通知
SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES
More About Auto-encoder
Speaker : YI-CHENG HUNG
2 Number Systems, Operations, and Codes
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
《牛津初中英语》 简 介 ( 9B) 江苏省教研室 何 锋.
WiFi is a powerful sensing medium
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

The Growth of a Data Scientist 我的資料科學之路 李 育 杰 Data Science and Machine Intelligence Lab 國立交通大學應用數學系 台灣資料科學年會 July 14-17, 2016

Big Data 3 V

台灣資料科學年會演講三要 要有趣 要有料 要有用

Agenda From Data Mining to Big Data Some my experiences in Data Science Breast Cancer Diagnosis and Prognosis Malicious URLs Detection 露天拍賣詐騙商品偵測 … Final Remarks

Breast Cancer Diagnosis and Prognosis

Cell Nuclei of a Fine Needle Aspirate 電子顯微鏡下的組織液細胞 Nuclear feature extraction for breast tumor diagnosis, WN Street, WH Wolberg, OL Mangasarian IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, 861-870

從電子顯微鏡影像到細胞特徵

Breast Cancer Diagnosis via SVM 97% Ten-fold Cross Validation Correctness 780 Patients: 494 Benign, 286 Malignant

Who will be benefitted from Chemotherapy?

Survival Curves for Overall Patients: w./wo. Chemotherapy

Overall Clustering Process 253 Patients (113 NoChemo, 140 Chemo) Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Poor1: Lymph>=5 OR Tumor>=4 Compute Initial Cluster Centers Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor Good Intermediate Poor

Survival Curves for Good, Intermediate & Poor Groups

Survival Curves for Intermediate Group: Split by Chemo & NoChemo

The Lessons I Learned Privacy is an issue Working with domain experts is EXTREMELY important Y2K 少也賤,故能多鄙事 Breast cancer survival and chemotherapy: a support vector machine analysis YJ Lee, OL Mangasarian, WH Wolberg, Discrete Math. Problem with Medical Application, DIMACS Workshop Survival-time classification of breast cancer patients YJ Lee, OL Mangasarian, WH Wolberg Computational Optimization and Applications 25 (1-3), 151-166

Malicious URLs Detection Can you filter out the benign URLs ONLY based on the URL stream? Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence

Malicious Websites Malicious websites have become tools for spreading criminal activity on the Web Phishing Malware (a) paypal.com (b) paypal.com-us.cgi-bin-webscr... Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013

Defences of Malicious URLs Blacklist service PhishTank Spam and Open Relay Blocking System (SORBS) Real-time URI Blacklist (URIBL) Malicious URLs detection Google Google safe browsing Trend Micro Web Reputation Service Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence

Why Need a Filtering Mechanism? URL requests received from users all over the world 3,000,000,000 ~ 7,000,000,000 per day 200,000,000 ~ 800,000,000 need to be analyzed Only 0.01% are malicious URLs Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information Filtering Mechanism 09/10/2013 Lab of Data Science & Machine Intelligence

Requirements from Industry No page content need for prioritization Prioritization means to return the most suspicious URLs No host based information is allowed Effectiveness Filtering (Download) Rate = Filtered URLs/Total URLs < 25% Malicious Coverage = Filtered Malicious URLs/ Total Malicious URLs > 75% Performance – Filtering > 2000 URLs per second for 1 dual-core VM with 4GB memory. Scalability One hour data should be consumed in one hour Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information 09/10/2013 Lab of Data Science & Machine Intelligence

Big Challenges Large scale data streaming One million URLs will be received per hour in average High dimension and sparse presentation Lexical information makes feature vector to become very sparse Extremely imbalanced data set Only contains about 0.01% malicious URLs Malicious URLs usually have very short life time Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence

Finding a Needle in a Haystack Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence

Our Main Results Malicious URLs Covering Rate 90% 75%, requirement Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 25%, if uniformly random Filtering Rate ≈ False Positive Rate 09/10/2013 Lab of Data Science & Machine Intelligence

Feature Extraction Limitations: Can not use Host-based information Web page content information Two types of feature sets are proposed Lexical features Descriptive features Two type of feature sets are proposed as different views for inspecting received URL 09/10/2013 Lab of Data Science & Machine Intelligence

Lexical Features: Information of Words The words in a URL string are translated into a Boolean vector Each Boolean value represents the occurrence of specific word Each URL component is split by specific delimiters and the words will be saved in a dictionary 09/10/2013 Lab of Data Science & Machine Intelligence

Lexical Features (cont.) Three character length sliding window on the domain name For the malicious websites which slightly modify its domain name For reducing memory usage of dictionary: Remove zero-weight words Remove word form argument value Replace IP with AS number (Using static mapping table) Replace the digits in word with regular expression Example: replace cool567 to cool[0-9]+ Keep the words generated in the last 24 hours only 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features: Static Characteristics of URL String Descriptive features observed from malicious websites For detecting the phishing websites A Digit between Two Letters (LDL) Examp1e A Letter between Two Digits (DLD) award2o12 For detecting malware website Executable File or Not Descriptive features are not easily changed by modifying the URL. 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features (cont.) Fraction of domain name Categorizing characters to letters, digits and symbols Splitting domain name by the connection of different categories Summing of longest token length of each category and divides by the domain name length 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features (cont.) For detecting randomly generated string Alphabet Entropy Number Rate For detecting abnormal phenomenon in URL string Length Length Ratio Letter, digit and symbol count. For detecting the common way on URL Using IP as Domain Name Default Port Number For covering the Sparse Features Delimiter Count The Length of Longest Word 09/10/2013 Lab of Data Science & Machine Intelligence

Collaborative Filtering Models We choose two online learning algorithms to update the model for Saving processing time and memory usage Adjusting model from concept drift of data streaming Two prospects of features to build two filters For descriptive features Passive-aggressive algorithm For lexical features Confident weighted algorithm Over-sampling technique for extremely imbalanced data set The online learning algorithm is a memory-efficient machine learning algorithm Different with batch learning, online learning algorithm doesn't need to keep old instances in memory for training. Usually be used in the large-scale problem 09/10/2013 Lab of Data Science & Machine Intelligence

Training Process 09/10/2013 Lab of Data Science & Machine Intelligence

Prediction Process 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation Data Set Measure Download Rate (DR) (TP + FP) / # of instances Missing Malicious Rate (MMR) FN / (TP + FN) 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Efficiency Environment CPU : Dual-core (3.00GHz) Memory : 4GB OS : Cent OS 64 bit Results For the security company, They use Intel(R) Xeon(TM) CPU 3.00GHz, 8 cores And they consumed 200~400 MB memory (so do we, depends on the dictionary size) With this environment, they can deal with: Average samples per hour: 1.6 million, about 10% of total traffic, takes 21.5 min on average And their results are all around 30% (both DR & MMR) 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Performance Settings Use one hour data for training/updating Predict next hour and record the results Compute the daily average of DR & MMR Apr. 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Performance Sep. Nov.-Dec. 09/10/2013

Why Collaborative Filtering Works? Both of two filters have a certain accuracy However, their results are different Set the download rate for each filter around 10% Apr. Average MMR of Descriptive Filter Average MMR of Lexical Filter 09/10/2013 Lab of Data Science & Machine Intelligence

The Lessons I Learned How to convert URLs stream into n-dimensional vector space How to deal with extremely unbalanced data Brain storming to define and extract features Choose a right learning algorithm How to deal with industry Malicious URL filtering -A big data application MS Lin, CY Chiu, YJ Lee, HK Pao, Big Data, 2013 IEEE International Conference on, 589-596

露天拍賣詐騙商品偵測

Lab of Data Science & Machine Intelligence 露天個案研究 利用機器學習輔助審查人員偵測詐騙商品 2018/11/10 Lab of Data Science & Machine Intelligence

露天 is the Top 1 in Shopping Category

人怕出名豬怕肥 露天這麼厲害一定也會有麻煩找上門

詐騙事件層出不窮 沒錯 露天就成為詐騙集團的下手目標

165 警政署高風險場排行榜 104年1至12月前10名高風險賣場 165警政署統計資料 排名 賣場名稱 總件數 1 露天拍賣 1418 2 86 小舖 824 3 SHOPPING99 804 4 奇摩拍賣 674 5 HITO本舖 592 6 小三美日 462 7 金石堂網路書店 368 8 衣芙日系 349 9 奇摩超級商城 317 10 樂天 217

網路電商都要打假、防詐騙

露天有甚麼機制防範詐騙? 那露天有沒有什麼機制去防範詐騙發生? 有的

在露天上架時所有商品會經過一個篩選機制被選到的商品會通過審核人員的審核 進入到商品頁面的商品如果有善心人士發現是可疑的詐騙商品就可以在商品頁面上檢舉或是直接到客服頁面檢舉 由客服來下架可疑詐騙商品 或是如果不幸有詐騙商品流到商品頁面騙到消費者 消費者也可以檢舉讓客服處理 所以在這個環節中最重要的就是這個篩選機制

檢測專家遇到的挑戰 上架商品數量百萬等級 如果24小時全年無休一個人一秒看一個商品也要十個人才能消化每天全部上架的商品

如何聰明地篩選需檢查的商品? Machine Learning 所以我們需要一個強化版的篩選機制

Major Challenges 極端不平衡的資料 詐騙集團策略會改變 好人佔大多數極少數壞人 詐騙商品的多樣性 但壞人會上傳大量詐騙商品 詐騙商品會隨季節流行改變 收集與前處理資料困難 釐清需要甚麼資料 需要的資料散佈在不同資料來源 露天傳統資料儲存系統無法滿足所需操作 資料量龐大 每天有百萬級商品上傳 需要即時性處理 但事情沒像想像中那麼簡單

Get your Hands Dirty 訪談審核人員、確認SOP 確認所需資料來源 確認資料流 從不同資料來源中萃取有用資料 從審核人員累積的 (Label) 資料中建模

審核專家經驗 – 帳號 以使用者帳號為例,詐騙集團假冒正當賣家的方式 re9n6LG0 4JEMxCRFwW jiJo4tpK 7FCXfrgO 2Ntla 5YsJSafSg8 jcshgx 12pm1c 以使用者帳號為例,詐騙集團假冒正當賣家的方式 當露天帳號申請門檻不高時,詐騙集團會大量申請殭屍帳號,帳號通常會像亂數,由程式產生 經由各種手段盜用正常使用者帳號,被盜用的使用者通常為長久無使用的帳號,所以可比對最後一次登入日期與上架日期判定

審核專家經驗 – 付款方式與地點 以商品所在地為例詐騙集團避免面交所以常常把商品所在設在較偏遠地區 也不喜歡貨到付款這種付款方式 ip 該如何使用?

確認資料來源與資料流

系統架構 露天主要資料庫 Log檔案 原始資料庫鏡像 用戶畫像 用戶特徵、商品特徵 額外應用 預測模型 Labeling 檢測專家

如何看一件商品?

Data Set Training set: (251827, 20) Testing set: (109122, 20) 上架日期 商品類別 商品數目 付款方式 商品價格 商品地點 商品運送方式 上架方式 使用者帳號 有無認證 ip 上次上架日期 是否第一次上架 1467043200 生活、居家 999 信用卡 520 台南市 7-11取貨 單品上架 xxxxx 有 xxx.xx.xxx.xxx 2016-06-27 23:59:59 N 手機、通訊 支付連 1461 台北市 xxxxxxx 無 休閒旅遊 50 貨到付款 19 xxx xx.xxx.xxx.xxx 2016-06-27 23:59:23 實際數據如下 為何最後會選擇LR是因為我們實驗過後發現LR的效果最好

Leading Board AUC

Competition Results Model Training time: 5~10 min Logistic Regression SVM Gradient Boost Decision Tree Naïve Bayes Training time: 5~10 min Testing time: 1~2 min

從工人智慧到人工智慧 審核人員僅能檢查相對少數的資料,即使焚膏繼晷 Sampling scheme 造成漏網之魚 數個月下來,累積的Labelled Data,應該發揮功用! We used these labelled data as the “ground truth” Our model can achieve 0.983 AUC 1,000 items can be examined in a second There is a hope to examine EVERY item

Keep the Experts in the Loop 審核人員對我們依然重要! We use our model to rank the “suspicious level” of item We can let 審核人員 examine the “gray” items Get the new labelled items from 審核人員 We can re-train our model with new labelled data Tango with bad guys

Data Science Team in Ruten We are hiring!

Strongly Recommend Books

Final Remarks If you torture the data long enough, it will confess to anything -Ronald H. Coase You data analytics results can not be beyond the nature rules Working with domain experts is very important Knowing the algorithms you used Be eager for data

Questions?

Thank You!

Social Network Services Part of Data Source from Cyberspace

Your Smart Home Part of Data Source from Physical World I am watching you! http://couturedigital.com/Wordpress/wp-content/uploads/2011/03/23-Zone-Smart-Home-Hertfordshire.jpg