Presentation is loading. Please wait.

Presentation is loading. Please wait.

(7) Data, Privacy and Metrics 数据,隐私和测量

Similar presentations


Presentation on theme: "(7) Data, Privacy and Metrics 数据,隐私和测量"— Presentation transcript:

1 (7) Data, Privacy and Metrics 数据,隐私和测量
The Networked Economy: Information Management, Strategy, and Innovation 网络经济:信息管理, 战略, 和创新 (7) Data, Privacy and Metrics 数据,隐私和测量

2 Agenda 议程 Role of data in decision making 数据在决策中的地位
Size and cost of storage 数据存储的规模和成本

3 River Nile 尼罗河

4 Notre Dame 巴黎圣母院

5 From Faith to Data 从盲从到数据
The Era of Faith 盲从时代 Massive investments into cathedrals etc. 巨额投资建设大教堂 Unclear ROI (Return on Investment) 投资收益(ROI)不明确 No feedback, or l_o_n_g feedback cycle 无反馈或反馈周期长 The Era of Data 数据时代 Massive investments into measuring, networking, communication, storage 大量投资于测量、网络、通讯和存储 ROI measurable 投资收益率可测 Short feedback cycle 反馈周期很短 Experiments 试验

6 Characteristics of our Era 当前时代特征
What do we do with data? 拿数据怎么办? Gather data 收集数据 Explore data 探测数据 Exploit data 采集数据 Publish data 发布数据 Archive data 归档数据 what does it mean? 这意味着什么? too much of it… 数据太多了… D A T A D A T A D A T A will it integrate with my systems?… 能否与我的系统整合?… how can I act on it… 怎么用数据?    Opportunities and challenges for marketers, publishers, agencies…    给市场人员、出版机构、广告公司带来机遇与挑战

7 Measuring information and storage 信息度量和存储
Name 名称 缩 写 Nr Bytes 字节数量级 Example 例子 Byte*  字节 B 100 E0 One character 一个字符 kilobyte 千字节 kB 103 E3 1,000 bytes 1000字节 A short message 一封小电子邮件 megabyte 兆字节 MB 106 E6 1million bytes 100万字节 Text of a book 一本书 High-resolution image 一张高分辨率照 片 gigabyte 千兆字节 GB 109 E9 1,000 MB 1000兆字节 A CD 一张CD terabyte 万亿字节 TB 1012 E12 1,000 GB 1万亿字节 Storage on laptops in this room 这间屋子里笔记本电脑磁盘的容量 petabyte 千万亿字节 PB 1015 E15 1,000 TB 1000万亿字节 Size of web 万维网总容量 exabyte 百万万亿字节 EB 1018 E18 1million TB 100万万亿字 节 zettabyte 亿万亿字节 ZB 1021 E21 Relationship Byte (B) and bit (b): 1 Byte = 8 bits 字节与比特之间的关系:1字节=8比特

8 Measuring information and storage 信息度量和存储
Name 名称 缩 写 Nr Bytes 字节数量级 Example 例子 Comparison 比较 Byte 字节 B 100 E0 One character 一个字符 1 nm 1纳米 Atom: 0.1 nm 原子:0.1纳米 Kilobyte 千字节 kB 103 E3 1,000 bytes A short 一封小电子邮件 1 μm 1微米 Hair: 50μm thick 头发丝:50微米粗 Megabyte 兆字节 MB 106 E6 1 million bytes Text of a book 一本书 High-resolution pic 一张高分辨率照片 1 mm 1毫米 Floppy disk: 1 mm thick 软盘1毫米厚 Gigabyte 千兆字节 GB 109 E9 1,000 MB A CD 一张CD 1 m 1米 Terabyte 万亿字节 TB 1012 E12 1,000 GB Storage of laptops in this room 一房间的电脑磁盘 1 km 1千米 Mt Everest/Qomolangma: 8.8 km high 珠峰8.8千米高 Petabyte 千万亿字节 PB 1015 E15 1,000 TB Size of web 万维网总容量 1,000 km 1000千米 Exabyte 百万万亿字节 EB 1018 E18 1million TB E6 km 1百万千米 To moon: 0.4 E6 km 到月球距离:0.4百万千米 Zettabyte 亿万亿字节 ZB 1021 E21 E9 km 10亿千米 To sun: 0.2 E9 km 到太阳距离:0.2十亿千米

9 Internet 互联网 Surface web 表层页面 Email 电子邮件 10 billion pages 100亿个页面
= static pages =静态页面 10 billion pages 100亿个页面 10 kB… 100 kB / page 每页1万… 10万字节  100 TB … 1PB total storage 总容量为100万亿…1000万亿字节 Deep web 深层网页 10x size of surface web 10倍于表层页面 电子邮件 3 billion accounts 30亿个电子邮件帐户 10 s / day / account 每天每个邮箱10封邮件  30 billion s / day 每天300亿封邮件 1 kB / 每封电子邮件1kB  30 TB traffic per day 每天流量为10 TB  100 petabyte / year 每年100 千万亿字节 Storage cost (2008 ASW) 1 petabyte = USD 100k 1千万亿字节 = 10万美元 ASW check Usenet 73 terabytes of Usenet per year

10 Turning behavior into data 将行为转换为数据
Revealed preferences 显示出的偏好 Music 音乐 Search  搜索 Online trading 在线交易 Online dating 网上交友

11 Additional sources of data about people 人类的其它数据来源
Movement 移动 Mobile phones 手机 GPS 全球卫星定位系统

12 “车开起来再付钱”保险

13 Everything can and will become data 任何东西都够能且一定会变成数据
Additional sources of data about people 人类的其它数据来源 Movement 移动 Mobile phones, GPS 手机,全球卫星定位系统 Brain activity 大脑活动 Neuromarketing 神经市场营销 fMRI analysis of response to stimuli 大脑皮层对刺激反映的fMRI分析 RFIDs (Radio Frequency Identifiers) Unique identifiers for objects, bridging physical and digital 目标:物体独特的标识 There 3 billion base pairs, Not just one billion ,in human genome. –Liu Jun

14 RFIDs and e-business 电子标签技术和电子商务
Facts  基本数字 Price: 2 US cent  价格:2美分 Size: 2 mm  大小:2毫米 It will happen: Big business  总会发生:大公司 Opportunities  机会 Inventory systems, Supply chain 库存系统,供应链 Wal-Mart saves USD 8 billion per year by using RFIDs 沃尔玛:使用电子标签技术后初步估算每 年节省80亿美元 Shipping screw-ups: 1 in 20 运输途中差错;概率1/20 Personalization 个性化 Fears  担心 Loss of privacy  隐私泄露 Abuse of data 恶意使用  Consumers need to be educated to make informed, conscious decisions about their data 消费者被教授如何利用数据做出精明的 决策 This level of transparency is “native” in e-business 如此透明对于电子商务可谓与生俱来 ADD SOME FROM ROD GOODMAN

15 Aspects of privacy 隐私的不同方面
Information 信息隐私 name, address, hobbies… 名字,地址,嗜好... Communication 交流隐私 phone calls, , SMS, … 电话号码,电子邮箱,短信... Territory 区域隐私 privacy of your office, home, bedroom, … 办公室,家庭,卧室隐私... Bodily privacy 身体隐私 strip searches, drug tests, … 裸体检查,毒品测试... LILY – PLEASE FIX AS I TOLD YOU IN CLASS

16 Some privacy concerns 隐私顾虑
Collection and storage 搜集和存贮 Extensive amounts of personally identifiable data collected and stored 广泛而大量个人可确认的数据被收集和存储 Unauthorized secondary use 未经授权而转做他用 Information collected from individuals for one purpose is used for another, purpose without authorization from the individuals 为了某一用途收集的个人信息,没有经过信息提供者的授权被用作其它用途 Improper access 非法访问 Data about individuals are available to people not authorized to view or work with these data 个人数据被一些没有授权的人浏览或使用 Combining data 合并数据 Personal data in disparate databases may be combined into larger databases 在不同数据库中的个人数据被合并成更大的数据库. *Source: Smith, H.J., Milberg, S. J., Burke, S. J., „ Information Privacy: MeasuringIndividuals‘ Concerncs About Organizational Practices“, MIS Quarterly, June 1996

17 Errors in personal data 个人数据错误
People worry that protections against errors in personal data are inadequate 人们担心针对个人数据错误而采取的保护措施不够 Errors by accident or deliberate 意外或者故意的错误 People increasingly demand access to their personal information 人们越来越需要有权使用他们的个人信息 Revealed preferences often differ from stated preferences 显示的参数通常与规定的参数有一些区别 Perception matters often more than objective facts 感性认识通常比客观事实更重要 Question: 问题 Describe processes how people can correct errors in data about themselves 描述一下增强人们对他们个人数据纠错的过程

18 Different people have different privacy concerns 不同的人有不同的隐私顾虑
Never 从不 Privacy fundamental list 什么信息 也不愿透露 30% Profiling averse 不愿透露 个人信息 26% Under certain conditions 在一定情况下 Profile revelation 透露个人信息 Identity concerned 不愿透露 身份 20% Marginally concerned 没有顾虑 24% Under certain conditions 在一定情况下 Always 总是 Never 从不 Identity revelation 透露身份

19 Privacy becoming increasingly more relevant 隐私越来越重要
Personal information becomes ubiquitous with electronic transactions 由于电子交易的存在,个人信息无处不在 Personal information is at the core of privacy 个人信息是隐私的核心 Privacy is a fundamental right that has been recognized by democratic societies across centuries and across geographies 隐私长期、不分地域地被民主社会认为是人的最基本权利 Privacy is a proven customer concern 隐私问题是被证实存在的顾客的顾虑 Privacy breach increasingly becomes a relevant social cost (including companies) 对隐私的破坏成为一项越来越大的社会成本 As companies have begun to treat customer information as an asset, people learn to consider their information as an asset 由于公司已经把顾客信息当作一项资产来对待,人们也学着把他们自己的信息当作资产

20 People trust less in way companies deal with their data 人们逐渐对公司如何利用他们的数据产生怀疑
Most businesses handle the personal information they collect about consumers in a proper and confidential way. 大多数公司把他们搜集到的消费者个人信息以正确和保密的方式处理 Strongly/Somewhat Disagree % 年 强烈/有些不同意 % Strongly/Somewhat Disagree % 年 强烈/有些不同意 % Strongly/Somewhat Disagree % 年 强烈/有些不同意 % *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

21 People feel increasingly less protected 人们越来越觉得没有受到应有的保护
Existing laws and organizational practices provide a reasonable level of protection for consumer privacy today. 目前现存的法律和组织活动为消费者隐私保护提供了合理的保护。 Strongly/Somewhat Disagree % 年 强烈/有些不同意 % Strongly/Somewhat Disagree % 年 强烈/有些不同意 % Strongly/Somewhat Disagree % 年 强烈/有些不同意 % *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

22 Privacy backlash could have a considerable impact on a companies‘ bottom line. 隐私的负面作用会对公司的业绩产生相当大的影响。 If you were to hear or read that a company with which you were a customer was collecting, sharing or using customer’s personal information in a way you did not think was proper, which one of the following best describes what you would do? 如果你听到或读到你作为他们消费者的公司在搜集,分配和使用客户个人信息 时使用了你认为不合适的方式,那么你会采取下面哪一种方式? 停止与公司的业务往来 % 减少与公司的业务往来 % 继续与公司的业务往来因为对我来说无所谓 % *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

23 Professionals are significantly less concerned about privacy issues when they are being asked as professionals compared to when they are asked in private.* 人们在工作的时候被问及对待隐私问题的态度时的回答与在家里被问到时相比,显得很不在意。* 1=unimportant, 1= 不重要 7=extremely important 7=非常重要 *Source: Esrock, S.L.., Ferré, J.P., „ A Dichotomy of Privacy: Personal and Professional Attitudes of Marketers“, Business and Society Review, 104: 1, 1999, pp

24 Privacy Principles by US Federal Trade Commission (FTC) 美国联邦贸易委员会(FTC)提出的5个原则来确保尊重隐私
Notice/ Awareness 通知/意识 Detailed advise to visitors of policies w.r.t. the personal data you process: What data is collected by whom, shared with, used for, consequences of refusal to provide data, … 向策略的咨询者提供有关处理个人数据的策略的一些细节:什么数据是由谁来收集的,与谁共享,用途,拒绝提供数据的结果,记录数据的活动… Choice/ Consent 选择/同意 Giving consumers options as to how information collected may be used, esp. w.r.t. ‘secondary uses’; opt-in versus opt-out debate & granularity of privacy choices given 给顾客一些选择权,比如将如何使用收集到的信息,尤其是关于“其它的用途”;考虑选择参加与选择退出,以及给予隐私选择的间隔 Access/ Participation 访问权/参与 Letting people about whom you have information find out what that information is, and contest its accuracy and completeness if they believe its wrong. 让人们了解你获得了有关他们的什么信息,如果他们觉得有错误,你要能够说明这些信息的准确性和完整性。 Enforcement/ Redress 强制执行/重新调整 Comply with the privacy laws in a country, subscribe to an industry code of practice or participate in a privacy seal program,... 遵守国家的隐私法,遵守行业规范,或者参与一个隐私认证计划。 Integrity/ Security 诚信/安全 Data must be accurate and secure. Data collector must use only reputable sources of data and cross-reference data against multiple sources, providing consumer access to data, destroying untimely data or converting it into anonymous form. 数据必须准确而且安全。数据收集者必须仅仅使用可信的数据源,在多个数据源进行前后对照,向顾客提供数据访问通路,消除不适时的数据,或者把这些数据转化成匿名的形式。

25

26 Storage is free 免费存储 PB 美元 10 exabyte 10 EB 年 年 每千兆字节的成本 硬盘存储容量
Or: Money makes the world go round Dramatic drop in price 价格大幅下降 (2008: 1GB costs 10 US cents) Exponential increase in storage 存储量呈指数增长

27 sina.com Oct 8, 1997 (web.archive.org) 新浪1997年10月8日

28 Internet Archive 互联网历史档案
Stores versions of surface web since 存储1996年以来的表层网页 Collected via opt-out 通过opt-out收集 1TB / day raw data 每天1万亿字节的原始数据 1 petabyte stored total 总存储量为1000万亿字节

29 Implicit 隐秘采集数据 (Clicks etc.)
Explicit 公开采集数据 (Surveys etc.) (调查问卷等) Implicit 隐秘采集数据 (Clicks etc.) (点击等) Data collected per day 每日采集的数据量 Time 时间 1990 2010 Why now? 为什么现在发生? Cost 成本 Storage 储存 Time 时间 1990 2010 Communication 通信 What a great opportunity for marketing! Explicit: rate items Self-personalize: myyahoo: only 20% Plus processing power Bottleneck used to be data and algo, now need Constraints! What ACTIONS are possible? What is trivial? What is useless? What is valuable? Communication  fast feedback loop So, what’s hard now? To make sense out of it! TIME SCALES Evene rabbits take a while Data collected implicitly: Dramatic growth over time 隐秘采集数据:随时间推移急剧增长 Data collected explicitly: Amount constant over time 公开采集数据:随时间推移变化不大

30 Implicit 隐秘采集数据 (Clicks etc.)
Explicit 公开采集数据 (Surveys etc.) (调查问卷等) Implicit 隐秘采集数据 (Clicks etc.) (点击等) Data collected per day 每日采集的数据量 Time 时间 1990 2010 Why now? 为什么现在发生? Cost 成本 Storage 储存 Time 时间 1990 2010 Communication 通信 What a great opportunity for marketing! Explicit: rate items Self-personalize: myyahoo: only 20% Plus processing power Bottleneck used to be data and algo, now need Constraints! What ACTIONS are possible? What is trivial? What is useless? What is valuable? Communication  fast feedback loop So, what’s hard now? To make sense out of it! TIME SCALES Evene rabbits take a while Malthus’s Law of Information: 马尔萨斯信息定律: New information content is doubling every year 新信息内容每年翻一番 Time spent on information consumption is constant 而信息消费时间几乎不变

31 Communication 通信 Why now? 为什么现在发生?
Malthus’s Law of Information: 马尔萨斯信息定律: New information content is doubling every year 新信息内容每年翻一番 Time spent on information consumption is constant 而信息消费时间几乎不变 Communication 通信

32 Voice over IP (VOIP) 网络电话
IP := Internet Protocol IP即互联网协议 Traditional phones are on their way out 传统电话正逐步走下历史舞台 Example: skype 例:skype电话 skype  skype: free skype  skype免费 skype  phone: 1c/ min skype 每分钟通话费1美分 Concurrent users (3/06): 5M Why is it so inexpensive? 为什么会如此便宜? 美国当地IP电话用户数量 单位:百万人 预测

33 Nr of words transmitted vs cost of transmission (US 1960-1980) 传输量与传输成本(美国 1960-1980)
收音机 电视机 报纸 杂志 有线电视 直接邮件 书籍 电影 教育 电话 邮件 数字通信 传真 电报 电传 邮递电报 每1000单词的传输成本(折成1972年的美元价值计算) 美国每年产生的单词量(单位:万亿)

34 Large e-business company: Amount of data created per year 大型电子商务公司年均数据产量
New data per year 每年新数据量 100 MB 10 GB 1 TB 100 TB Level 层次 Customer 消费者 Orders 订单 Session aggregates 访问总计 Clicks 点击 Amount of data ERROR: FIX TRANSLATION! Same as google logs: 100G / day Largest lab for people data Vision summary E.g., to compute convergences “Information age” Interaction effects Visit Level / could talk about Visit Level (daily aggregates) WHY?? ADD BENEFITS MORE: Site instrumentation 网址工具 JavaScript (Mouse movement, scrolling鼠标移动、滚动) 数据量

35 The iterative process of modeling and decision making 建模和决策的互动过程
Define 定义 Business metrics, objectives and baselines 业务度量,目标和基准 Measure 测量 Collect, store, manage data 收集、储存和管理数据 Describe 描述 Exploratory data analysis 探测性的数据分析 Predict and evaluate 预测和评估 Probabilistic models 概率模型 Decide, act, and evaluate 决策,行为和评估 Re- (重新)  Design 设计 Analyze 分析 Generalize 概括 1) Baseline: what does it mean to do well? Instrument the site CONTROL Me obsessed with evaluation Learn 学习

36 1.Business metrics and objectives 1.业务度量和目标
Stock price 股票价格 Profit 利润 Number of items sold 销售数量 Number of visits 访问量 Conversion rate 行动转化率 Customer acquisition 赢得消费者 Customer retention 留住消费者 Customer satisfaction 消费者满意度 Trade-off 此消彼长 Trade-off 此消彼长 Own inventory? Marketplace? New categories? 总体vs新种类 Writing papers To increase transparency of business (and, of course, return on investment) DISCUSS: Number of clicks per visit 每次访问点击数量 CUSTOMER DELIGHT

37 Company Interactions 消费者—公司互动 Customer Behavior 消费者行为
2. Measure 2.测量 Customer- Company Interactions 消费者—公司互动 Customer Behavior 消费者行为 Company Behavior 公司行为 Orders 订单 Overall use of the site 网站的综合利用 Buying vs selling 购买 vs. 出售 Searching vs browsing 搜索 vs. 浏览 Engagement: Reviews, etc.参与:评论等 Customer service contacts 消费者服务联系 , phone 电子邮件,电话 Surveys 调查问卷 Satisfaction  满意度 Intentions and goals 意图/目标/模式 Customer service response 消费者服务回复 Resolution 解决方案 Free replacement, refund 免费退换,退款 Delivery date: Actual vs promised 交货日期 : 实际的与允诺的 Number of items returned in a search 搜索结果 campaigns and responses 电子邮件广告和回应 % Think what your company can collect!! Different sources!不同来源 Touchpoints More: competitors prices

38 Why is it hard? 为什么这么难? … and store
Even simple behavioral analysis requires significant infrastructure 即使简单的行为分析也需要复杂的基础建设 Reporting Behavioral analysis, predictive modeling and action (e.g., recommendations) 报告 行为分析和预测模型 Cost center Profit center 成本中心 利润中心 … and store Amazon’s data production rate is comparable to that of satellite television

39 Business questions 商业问题
How many people are coming to my site? 有多少人会来访问我的网站? Who are they? 他们是什么样的人? Where are they coming from? 他们来自什么地方? What are they doing? 他们从事什么职业? Who’s coming back and how frequently? 哪些人会再次访问网站,以什么样的频率? How is all of this changing over time? 这种情形随着时间会发生什么变化? What is the impact of a recent site change? 最近一次网站的变化会产生什么影响?

40 Twyman’s Law 图曼法则 Any statistic that appears interesting is almost certainly a mistake 任何一个看起来有趣的统计数据基本上都是错误 Validate “amazing” discoveries in different ways 以不同的方式证实“令人吃惊的”发现 They are usually the result of a business process 他们通常 是业务流程的结果 5% of customers were born on the exact same day (including year) 5%的顾客同年同月同日出生 11/11/11 is the easiest way to satisfy the mandatory birth date field 11/11/11是填写出生年月日的最简单的方式 For US Web sites, there will be a small sales increase on Oct 4, 2008, for European Nov xx 对于美国的网站,2008年10月4日销售额会有小小的增长;而欧洲的网站则会在2008年11 月xx日出现销售的增长 For Oct 29, 2006 it’s both Europe and the US. For starting DST, the dates are different. Don’t forget to change your batteries: More than 90 percent of homes in the United States have smoke detectors, but one-third are estimated to have dead or missing batteries.

41 Some experiences 经验 Synchronize clocks from all data collection points 同步记录所有数据收集点 Example: Some servers were set to GMT and others to Pacific time, leading to strange anomalies 例如:有的服务商设定为格林尼治标准时间,而有的则设定为太平洋时间,导致异常出现 Even being a few minutes off can cause add-to-carts to appear “prior” to the search 即使只有几分钟的差异也会使得结果优先于搜索 Remove test data 清除测试数据 QA organizations constantly test the system 品质保证组织经常测试系统 Make sure the data can be identified and removed from analysis 因此要确保数据可以从分析结果中被识别和移除 Remove robots/bots/spiders 移除网络蜘蛛 (一种关键字查询程序) 5-40% of site e-commerce site traffic is generated by crawlers from search engines (and students doing problem sets) 5-40%的电子商务网站流量是由网站浏览者浏览搜索引擎以及学生做作业时查找资料时产生的 These significantly skew results unless removed 只有移除这些网络蜘蛛才不会干扰正常的结果 [Some people bought fairly expensive products for less than 5 cents. Note this is an example of a multi-variate anomaly. It is OK for some products (e.g. gum) to be 5 cents, but not for other products. 26 different ways of spelling Mitsubishi!. Use drop down lists instead of free text fields]

42 Picking the right visualization is key to seeing patterns 选择正确的形象是识别特征的关键
Heat map 热图 Shows traffic colored from green to yellow to red 用颜色(从绿色到黄色到红色)显示流量 Utilizes cyclical nature of the week 利用一周的周期性特点 Note 9/3 (Labor Day) and 9/11 注意:9/3(劳动节)和9/11 Traffic by day 按天计流量 Easy to see weekends 容易识别周末 Difficult to see other patterns 很难区分其他的特征 Weekends 周末 ASW replace by own site ASW add gay.com Explain the heatmap. Note that Friday’s are generally weaker. The next version of office (office 2007) has heatmaps.

43 Business-level lessons 业务层次上的经验
Collect operation business data 从运营的角度收集业务层次上的数据 Data usually not in web logs 而不是网页记录 Searches 搜索 Response times to return results 返回结果的回应时间 Shopping cart events 购物车 Registration forms 注册表 External events 外部事件 Marketing promotions 促销 Site changes 网站变更 Choose to collect as much data as you realistically can because you do not know what might be relevant for a future question. 选择收集尽可能多的数据,因为你不知 道什么数据会与你将来的问题相关 Consider privacy issues Often aggregated or anonymous data suffices 对于隐私性问题有一些难度,但是无差异 的数据通常没有问题)

44 Collection example – Form Errors 数据收集例子-格式错误
Here is a good example of data collection that was introduced without knowing apriori whether it will help: form errors 有一个数据收集很好的例子就是根本不知道apriori是否会起作用就把它收集进来,这就是格式错误 If a web form filled and a field did not pass validation, log field and value 填写网页表格,区域未通过确认,登录域和数值 This was the Bluefly home page when they went live 这是过去Bluefly的网页 Looking at form errors, we saw thousands of errors every day on this page 我们在这个网页上发现了成千上万的格式错误 Any guesses? 猜想? People filled in search keywords into the box that says “sign up for .” Easy to fix. BTW, search has to be on the home page. Amazon also made this mistake when it went live: there was no search box on the home page.

45 Summary 总结 Think about the problem end-to-end
Collection 搜集 Transformations 转化 Reporting 报告 Visualizations 视觉化 Modeling 建模 Taking action 行动 Agree on terminology 术语的一致性 How do you define a session? 怎样定义访问停留? How do you define a customer? 怎样定义客户? (e.g., did every customer make a purchase)? 例如:每一个顾客都购买了吗? Beware of hidden variables when concluding causality 当包含因果关系的时候注意隐藏变量 E.g., Simpson’s paradox 例如:辛普森的矛盾论 Conduct controlled experiments (A/B tests) when possible -- our intuition is poor 如果可能的话进行可以控制的实验 (A/B实验),我们的直觉是不可靠 的

46 Weblog entry [29/Jun/2006:13:38: ] "GET / HTTP/1.1" "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" [29/Jun/2006:13:38: ] "GET / HTTP/1.1“ 200 17497 "-” "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

47 APPENDIX 附录

48 Books and Libraries 书和图书馆
30 million books 3000万册藏书 1 MB per book (text) 每本书大小为1兆字节(文本) 100 characters per line 每行100个字符 100 lines per page 每页100行 100 pages per book 每本书100页 30TB for text 文本大小共有30万亿字节(30TB) 1 petabyte = $ 1 million 1千万亿字节 = 100万美元 Are books the right medium for archiving? 书是合适的存储媒体吗? Digital storage cheap: $10k for all 存储便宜:存储所有这些只需花5万美 元 Scanning expensive: $10 per book 扫描昂贵:每册10美元 But: $300M in one shot, then done forever! 但3亿美元是一次性投资,已经完成 Scanning all books is half a year of Library of Congress’s budget 扫描所有书籍耗费国会图书馆半年的预算 Books in print: 3.2 million 在版书籍:320万册 Books sold in US in 1999: 1.1 billion 美国1999年售出的书籍:11亿册 FITS IN A BOX GETS YOU A HOUSE… well, a garage in SF

49 Internet 互联网 Email 电子邮件 Web 网页 Comparison (banner ads) 比较(网页广告)
Surface web 表层页面 = static pages =静态页面 10 billion pages 100亿个页面 10 kB… 100 kB / page 每页1万… 10万字节  100 TB … 1PB total storage 总容量为100万亿…1000万亿字节 Deep web 深层网页 = underlying databases =底层数据库 10x size of surface web 10倍于表层页面  1 … 10 petabyte 1 … 10 PB 电子邮件 1 billion accounts 10亿个电子邮件帐户 10 s / day / account 每天每个邮箱10封邮件 10 billion s / day 每天100亿封邮件 1 kB / 每封电子邮件1kB  10 TB traffic per day 每天流量为10 TB 30 petabyte / year 每年30 千万亿字节 Comparison (banner ads) 比较(网页广告) 4 billion ads / day served by DoubleClick DoubleClick每天做40亿条广告 Usenet 73 terabytes of Usenet per year

50 Information production 信息产量
Surveillance 监控摄像 30 exabyte / year 每年30 EB 30M cameras 3000万摄像头 3 frames / sec -> 100M pics / sec 3帧 / 秒 -> 1亿张图片/秒 10kB / pic -> 1 TB/sec 10kB/张 -> 1TB/秒 100k secs / day -> 100 petabyte / day 10万秒/天 -> 100 PB / 天 One day of production of surveillance cameras 监控摄像头一天产生的信息量 = 1 year of all traffic =1年的电子邮件流量 = 100+ years of data stored by Amazon.com =亚马逊100多年的数据存储量 Mail (US only) 197 billion pieces 733 pieces per year per person 150 petabytes per year (counting dups)

51 Information production 信息产量
20 exabyte / year flow through telephone, internet, radio, TV 每年20 EB 通过电话,因特网,收音机,电视传送的信息量 Telephone  电话 1013 minutes per year (worldwide, 2005 estimate) 每年通话时间1013 分钟(2005年全球的预计数据) 10 exabyte per year 每年10 EB 5 exabyte / year of new data was produced and stored in 年一年新产生并存储的数据达5 EB Corresponds to 1 GB per person (worldwide) per year 相当于(全世界)每年每人1GB Corresponds to 10 meters of printed books per person per year 相当于每人每年写书10米高 After removing duplicates, 1 exabyte of new information per year 去除重复计算,每年有1EB的新信息 Corresponds to one million new libraries per year… 相当于每年新建100万个图书馆 … or one large library per minute …也就是一分钟建一个图书馆 US is 40% of world total 美国占世界总量40%

52 Information 信息 Digital: 80…95% 数字信息:80~95%
Mainly magnetic (hard disks) 主要是硬磁盘 Non-digital: 5…20% 非数字信息:5…20% Mainly film 主要是胶片 Little paper 很小一部分是纸 (0.01%) Very little CD, DVD (“optical”) CD 用碟片、光盘存储更少

53 Most information is produced by individuals 大部分信息由个人生产
Most information is created by individuals not institutions 大多数信息是由个人而不是机构生产 Telephone calls, , printouts, photos 电话,电子邮件,印刷品,照相 We don’t know how to organize it 不知应如何组织这些信息 Note: Paper consumption is growing, but most is printed off digital media 注:纸张消费在增加,但大多用于打印数码信息 Office documents and mail outnumber books, newspapers and journals 办公文档和邮件要比书籍、报纸和杂志多 North Americans consume 24 reams (11,916 sheets) of paper annually; European Union consumes 15 reams, or (7,280 sheets); world average is 1,500 sheets each. 北美每人每年消耗24令(11,916张)纸;欧盟15令(7,280张);全世界平均1,500张 In the US, at least half of this paper is used to produce office documents, mostly computer printouts 在美国,这些纸张中至少一半用于打印办公文档,大多是电脑打印

54 Film 胶片 Photos 照片 Movies 电影 X-rays X光片
80 billion shots per year 每年800亿张 2700 shots per second 每秒2700 张 80 percent of US households have camera 美国家庭80%都拥有相机 15 percent of Chinese households 中国家庭15% China is 2nd largest market 中国是第二大市场 70% of US purchasers say they will buy digital next time 70%的美国购买者说下一步要买数码的 Movies 电影 4,250 movies per year worldwide 全世界每年摄制4250部电影 X-rays X光片 2 billion X-rays per year 每年20亿张X光片 Film is less than 10% of total 胶片不到总数的10% Estimates based sales of film materials 根据胶片销售估计

55 Paper 纸张 Books 书籍 印刷概况 Newspapers 报纸
1 million titles per year (UNESCO) 每年100万种(教科文组织) Newspapers 报纸 23,000 published per year( 25 terabytes) 每年发行23,000种(25 TB) Scholarly journals (Ulrich’s) 学术期刊 40,000 published per year 每年发行40,000种 Magazines (Ulrich’s) 大众期刊 80,000 published per year 每年发行80,000种 印刷概况 Paper is less than 0.01% of total 纸质信息不到总数的0.01% Office documents are mainly printouts of digital 办公文档主要是打印数码信息 … growing! 不断增长


Download ppt "(7) Data, Privacy and Metrics 数据,隐私和测量"

Similar presentations


Ads by Google