The Growth of a Data Scientist 我的資料科學之路

Slides:

Advertisements

Similar presentations

《互联网运营管理》系列课程觉浅网荣誉出品

Advertisements

Course 1 演算法: 效率、分析與量級 Algorithms: Efficiency, Analysis, and Order

宏观经济学 N.Gregory Mankiw 上海杉达学院.

METAEDGE Corporation Taiwan

Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取

How can we become good leamers

二維品質模式與麻醉前訪視滿意度中文摘要麻醉前訪視，是麻醉醫護人員對病患提供麻醉相關資訊與服務，並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式，設計病患滿意度問卷，探討麻醉前訪視內容與病患滿意度之關係，以期分析關鍵品質要素為何，作為提高病患對醫療滿意度之參考。本研究於台灣北部某醫學中心，通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患，其中實驗組共107位病患，在麻醉醫師訪視之前，安排先觀看麻醉流程衛教影片；另外對照組111位病患，則未提供衛教影片。問卷於麻醉醫師

BOTNET Detection and Prevention

Web of Science新平台纵览 Jan. 2014

資料庫設計 Database Design.

OMC 商業智庫劉老師講題大綱參考資料.

train n. 火车 subway n. 地铁 bus n. 公共汽车 bike n. 自行车.

SHARE with YOU Why am I here? (堅持……) What did I do?

Unit 5 Dialogues Detailed Study of Dialogues (对话) Exercises（练习）

Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者：陳宜樺報告日期：2015/9/25.

Operating System CPU Scheduing - 3 Monday, August 11, 2008.

深層學習暑期訓練 (2017).

Homework 4 an innovative design process model TEAM 7

Module 5 Shopping 第2课时.

Some Effective Techniques for Naive Bayes Text Classification

Applications of Digital Signal Processing

Rate and Distortion Optimization for Reversible Data Hiding Using Multiple Histogram Shifting Source: IEEE Transactions On Cybernetics, Vol. 47, No. 2,February.

Population proportion and sample proportion

International Conference ITIE2010: Inspiration from Best Practices

異質計算教學課程內容「異質計算」種子教師研習營洪士灝國立台灣大學資訊工程學系

計算方法設計與分析 Design and Analysis of Algorithms 唐傳義

初二英语写作课课件福建省闽清县第一中王国豪

Source: IEEE Access, vol. 5, pp , October 2017

Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)

Journal Citation Reports® 期刊引文分析報告的使用和檢索

肢體殘障人士 Physically handicapped

Digital Terrain Modeling

Faculty of Arts Lingnan University 嶺南大學文學院

巨量資料分析與應用 (1) 楊立偉教授台大工管系暨商研所 2014 Fall.

China Standardization activities of ITS

第5章資料倉儲的資料建置.

圖表製作集中指標 0628 統計學.

This Is English 3 双向视频文稿.

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

塑膠材料的種類塑膠在模具內的流動模式流動性質的影響溫度性質的影響

客户服务询盘惯例.

Source: IEEE Transactions on Image Processing, Vol. 25, pp ,

基于课程标准的校本课程教学研究乐清中学赵海霞.

英语教学课件九年级全.

谈模式识别方法在林业管理问题中的应用报告人：管理工程系马宁报告地点：学研B107

Version Control System Based DSNs

Maintaining Frequent Itemsets over High-Speed Data Streams

Guide to a successful PowerPoint design – simple is best

Ericsson Innovation Award 2018 爱立信创新大赛 2018

虚拟仪器 virtual instrument

Common Qs Regarding Earnings

Course 4 分類與預測 Classification and Prediction

Cisco Troubleshooting and Maintaining Cisco IP Networks (TSHOOT)

Unit 7 Lesson 20 九中分校刘秀芬.

2008 TIME USE SURVEY IN CHINA

Inter-band calibration for atmosphere

系统科学与复杂网络初探刘建国上海理工大学管理学院

美國亞利桑納州Eurofresh農場的晨曦

BiCuts: A fast packet classification algorithm using bit-level cutting

李宏毅專題 Track A, B, C 的時間、地點開學前通知

SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES

More About Auto-encoder

Speaker : YI-CHENG HUNG

2 Number Systems, Operations, and Codes

Chapter 9 Validation Prof. Dehan Luo

Class imbalance in Classification

《牛津初中英语》简介（ 9B）江苏省教研室何锋.

WiFi is a powerful sensing medium

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

The Growth of a Data Scientist 我的資料科學之路李育杰 Data Science and Machine Intelligence Lab 國立交通大學應用數學系台灣資料科學年會 July 14-17, 2016

Big Data 3 V

台灣資料科學年會演講三要要有趣要有料要有用

Agenda From Data Mining to Big Data Some my experiences in Data Science Breast Cancer Diagnosis and Prognosis Malicious URLs Detection 露天拍賣詐騙商品偵測 … Final Remarks

Breast Cancer Diagnosis and Prognosis

Cell Nuclei of a Fine Needle Aspirate 電子顯微鏡下的組織液細胞 Nuclear feature extraction for breast tumor diagnosis, WN Street, WH Wolberg, OL Mangasarian IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, 861-870

從電子顯微鏡影像到細胞特徵

Breast Cancer Diagnosis via SVM 97% Ten-fold Cross Validation Correctness 780 Patients: 494 Benign, 286 Malignant

Who will be benefitted from Chemotherapy?

Survival Curves for Overall Patients: w./wo. Chemotherapy

Overall Clustering Process 253 Patients (113 NoChemo, 140 Chemo) Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Poor1: Lymph>=5 OR Tumor>=4 Compute Initial Cluster Centers Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor Good Intermediate Poor

Survival Curves for Good, Intermediate & Poor Groups

Survival Curves for Intermediate Group: Split by Chemo & NoChemo

The Lessons I Learned Privacy is an issue Working with domain experts is EXTREMELY important Y2K 少也賤，故能多鄙事 Breast cancer survival and chemotherapy: a support vector machine analysis YJ Lee, OL Mangasarian, WH Wolberg, Discrete Math. Problem with Medical Application, DIMACS Workshop Survival-time classification of breast cancer patients YJ Lee, OL Mangasarian, WH Wolberg Computational Optimization and Applications 25 (1-3), 151-166

Malicious URLs Detection Can you filter out the benign URLs ONLY based on the URL stream? Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence

Malicious Websites Malicious websites have become tools for spreading criminal activity on the Web Phishing Malware (a) paypal.com (b) paypal.com-us.cgi-bin-webscr... Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013

Defences of Malicious URLs Blacklist service PhishTank Spam and Open Relay Blocking System (SORBS) Real-time URI Blacklist (URIBL) Malicious URLs detection Google Google safe browsing Trend Micro Web Reputation Service Social Engineering, Spam Identity Fraud Drive-by Download Botnet, zombie network 09/10/2013 Lab of Data Science & Machine Intelligence

Why Need a Filtering Mechanism? URL requests received from users all over the world 3,000,000,000 ~ 7,000,000,000 per day 200,000,000 ~ 800,000,000 need to be analyzed Only 0.01% are malicious URLs Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information Filtering Mechanism 09/10/2013 Lab of Data Science & Machine Intelligence

Requirements from Industry No page content need for prioritization Prioritization means to return the most suspicious URLs No host based information is allowed Effectiveness Filtering (Download) Rate = Filtered URLs/Total URLs < 25% Malicious Coverage = Filtered Malicious URLs/ Total Malicious URLs > 75% Performance – Filtering > 2000 URLs per second for 1 dual-core VM with 4GB memory. Scalability One hour data should be consumed in one hour Here need to emphasize the amount of requested URLs is too large to use the host-based information (whois information) or content information 09/10/2013 Lab of Data Science & Machine Intelligence

Big Challenges Large scale data streaming One million URLs will be received per hour in average High dimension and sparse presentation Lexical information makes feature vector to become very sparse Extremely imbalanced data set Only contains about 0.01% malicious URLs Malicious URLs usually have very short life time Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence

Finding a Needle in a Haystack Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 09/10/2013 Lab of Data Science & Machine Intelligence

Our Main Results Malicious URLs Covering Rate 90% 75%, requirement Normal URL stand for a longer time for usability This page should emphasize the 3 Vs 25%, if uniformly random Filtering Rate ≈ False Positive Rate 09/10/2013 Lab of Data Science & Machine Intelligence

Feature Extraction Limitations: Can not use Host-based information Web page content information Two types of feature sets are proposed Lexical features Descriptive features Two type of feature sets are proposed as different views for inspecting received URL 09/10/2013 Lab of Data Science & Machine Intelligence

Lexical Features: Information of Words The words in a URL string are translated into a Boolean vector Each Boolean value represents the occurrence of specific word Each URL component is split by specific delimiters and the words will be saved in a dictionary 09/10/2013 Lab of Data Science & Machine Intelligence

Lexical Features (cont.) Three character length sliding window on the domain name For the malicious websites which slightly modify its domain name For reducing memory usage of dictionary: Remove zero-weight words Remove word form argument value Replace IP with AS number (Using static mapping table) Replace the digits in word with regular expression Example: replace cool567 to cool[0-9]+ Keep the words generated in the last 24 hours only 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features: Static Characteristics of URL String Descriptive features observed from malicious websites For detecting the phishing websites A Digit between Two Letters (LDL) Examp1e A Letter between Two Digits (DLD) award2o12 For detecting malware website Executable File or Not Descriptive features are not easily changed by modifying the URL. 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features (cont.) Fraction of domain name Categorizing characters to letters, digits and symbols Splitting domain name by the connection of different categories Summing of longest token length of each category and divides by the domain name length 09/10/2013 Lab of Data Science & Machine Intelligence

Descriptive Features (cont.) For detecting randomly generated string Alphabet Entropy Number Rate For detecting abnormal phenomenon in URL string Length Length Ratio Letter, digit and symbol count. For detecting the common way on URL Using IP as Domain Name Default Port Number For covering the Sparse Features Delimiter Count The Length of Longest Word 09/10/2013 Lab of Data Science & Machine Intelligence

Collaborative Filtering Models We choose two online learning algorithms to update the model for Saving processing time and memory usage Adjusting model from concept drift of data streaming Two prospects of features to build two filters For descriptive features Passive-aggressive algorithm For lexical features Confident weighted algorithm Over-sampling technique for extremely imbalanced data set The online learning algorithm is a memory-efficient machine learning algorithm Different with batch learning, online learning algorithm doesn't need to keep old instances in memory for training. Usually be used in the large-scale problem 09/10/2013 Lab of Data Science & Machine Intelligence

Training Process 09/10/2013 Lab of Data Science & Machine Intelligence

Prediction Process 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation Data Set Measure Download Rate (DR) (TP + FP) / # of instances Missing Malicious Rate (MMR) FN / (TP + FN) 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Efficiency Environment CPU : Dual-core (3.00GHz) Memory : 4GB OS : Cent OS 64 bit Results For the security company, They use Intel(R) Xeon(TM) CPU 3.00GHz, 8 cores And they consumed 200~400 MB memory (so do we, depends on the dictionary size) With this environment, they can deal with: Average samples per hour: 1.6 million, about 10% of total traffic, takes 21.5 min on average And their results are all around 30% (both DR & MMR) 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Performance Settings Use one hour data for training/updating Predict next hour and record the results Compute the daily average of DR & MMR Apr. 09/10/2013 Lab of Data Science & Machine Intelligence

Evaluation: Performance Sep. Nov.-Dec. 09/10/2013

Why Collaborative Filtering Works? Both of two filters have a certain accuracy However, their results are different Set the download rate for each filter around 10% Apr. Average MMR of Descriptive Filter Average MMR of Lexical Filter 09/10/2013 Lab of Data Science & Machine Intelligence

The Lessons I Learned How to convert URLs stream into n-dimensional vector space How to deal with extremely unbalanced data Brain storming to define and extract features Choose a right learning algorithm How to deal with industry Malicious URL filtering -A big data application MS Lin, CY Chiu, YJ Lee, HK Pao, Big Data, 2013 IEEE International Conference on, 589-596

露天拍賣詐騙商品偵測

Lab of Data Science & Machine Intelligence 露天個案研究利用機器學習輔助審查人員偵測詐騙商品 2018/11/10 Lab of Data Science & Machine Intelligence

露天 is the Top 1 in Shopping Category

人怕出名豬怕肥露天這麼厲害一定也會有麻煩找上門

詐騙事件層出不窮沒錯露天就成為詐騙集團的下手目標

165 警政署高風險場排行榜 104年1至12月前10名高風險賣場 165警政署統計資料排名賣場名稱總件數 1 露天拍賣 1418 2 86 小舖 824 3 SHOPPING99 804 4 奇摩拍賣 674 5 HITO本舖 592 6 小三美日 462 7 金石堂網路書店 368 8 衣芙日系 349 9 奇摩超級商城 317 10 樂天 217

網路電商都要打假、防詐騙

露天有甚麼機制防範詐騙? 那露天有沒有什麼機制去防範詐騙發生? 有的

在露天上架時所有商品會經過一個篩選機制被選到的商品會通過審核人員的審核進入到商品頁面的商品如果有善心人士發現是可疑的詐騙商品就可以在商品頁面上檢舉或是直接到客服頁面檢舉由客服來下架可疑詐騙商品或是如果不幸有詐騙商品流到商品頁面騙到消費者消費者也可以檢舉讓客服處理所以在這個環節中最重要的就是這個篩選機制

檢測專家遇到的挑戰上架商品數量百萬等級如果24小時全年無休一個人一秒看一個商品也要十個人才能消化每天全部上架的商品

如何聰明地篩選需檢查的商品? Machine Learning 所以我們需要一個強化版的篩選機制

Major Challenges 極端不平衡的資料詐騙集團策略會改變好人佔大多數極少數壞人詐騙商品的多樣性但壞人會上傳大量詐騙商品詐騙商品會隨季節流行改變收集與前處理資料困難釐清需要甚麼資料需要的資料散佈在不同資料來源露天傳統資料儲存系統無法滿足所需操作資料量龐大每天有百萬級商品上傳需要即時性處理但事情沒像想像中那麼簡單

Get your Hands Dirty 訪談審核人員、確認SOP 確認所需資料來源確認資料流從不同資料來源中萃取有用資料從審核人員累積的 (Label) 資料中建模

審核專家經驗 – 帳號以使用者帳號為例，詐騙集團假冒正當賣家的方式 re9n6LG0 4JEMxCRFwW jiJo4tpK 7FCXfrgO 2Ntla 5YsJSafSg8 jcshgx 12pm1c 以使用者帳號為例，詐騙集團假冒正當賣家的方式當露天帳號申請門檻不高時，詐騙集團會大量申請殭屍帳號，帳號通常會像亂數，由程式產生經由各種手段盜用正常使用者帳號，被盜用的使用者通常為長久無使用的帳號，所以可比對最後一次登入日期與上架日期判定

審核專家經驗 – 付款方式與地點以商品所在地為例詐騙集團避免面交所以常常把商品所在設在較偏遠地區也不喜歡貨到付款這種付款方式 ip 該如何使用?

確認資料來源與資料流

系統架構露天主要資料庫 Log檔案原始資料庫鏡像用戶畫像用戶特徵、商品特徵額外應用預測模型 Labeling 檢測專家

如何看一件商品?

Data Set Training set: (251827, 20) Testing set: (109122, 20) 上架日期商品類別商品數目付款方式商品價格商品地點商品運送方式上架方式使用者帳號有無認證 ip 上次上架日期是否第一次上架 1467043200 生活、居家 999 信用卡 520 台南市 7-11取貨單品上架 xxxxx 有 xxx.xx.xxx.xxx 2016-06-27 23:59:59 N 手機、通訊支付連 1461 台北市 xxxxxxx 無休閒旅遊 50 貨到付款 19 xxx xx.xxx.xxx.xxx 2016-06-27 23:59:23 實際數據如下為何最後會選擇LR是因為我們實驗過後發現LR的效果最好

Leading Board AUC

Competition Results Model Training time: 5~10 min Logistic Regression SVM Gradient Boost Decision Tree Naïve Bayes Training time: 5~10 min Testing time: 1~2 min

從工人智慧到人工智慧審核人員僅能檢查相對少數的資料，即使焚膏繼晷 Sampling scheme 造成漏網之魚數個月下來，累積的Labelled Data，應該發揮功用! We used these labelled data as the “ground truth” Our model can achieve 0.983 AUC 1,000 items can be examined in a second There is a hope to examine EVERY item

Keep the Experts in the Loop 審核人員對我們依然重要! We use our model to rank the “suspicious level” of item We can let 審核人員 examine the “gray” items Get the new labelled items from 審核人員 We can re-train our model with new labelled data Tango with bad guys

Data Science Team in Ruten We are hiring!

Strongly Recommend Books

Final Remarks If you torture the data long enough, it will confess to anything -Ronald H. Coase You data analytics results can not be beyond the nature rules Working with domain experts is very important Knowing the algorithms you used Be eager for data

Questions?

Thank You!

Social Network Services Part of Data Source from Cyberspace

Your Smart Home Part of Data Source from Physical World I am watching you! http://couturedigital.com/Wordpress/wp-content/uploads/2011/03/23-Zone-Smart-Home-Hertfordshire.jpg