相關係數（Correlation）描述兩個變數X、Y之間的線性相關 Example: data1中的身高及體重如何量化這樣的線性關係呢？

Slides:

Advertisements

Similar presentations

Dr. Baokun Li 经济实验教学中心商务数据挖掘中心

Advertisements

單元七、spss與相關係數沈瑞棋.

人群健康研究的统计方法预防医学系指导教师：方亚电话：

數據挖掘課程王海深圳國泰安教育技術股份有限公司.

Chapter 15 複迴歸.

双变量关联性分析.

How to Use SPSS in Biomedical Data analysis

Chapter 3 預測.

生物統計與SAS軟體課程教學(三) 雙變項統計分析(一)

医学统计学 8 主讲人陶育纯医学统计学 8 主讲人陶育纯

§9.3 线性回归分析一. 什么是回归分析相关分析研究变量之间相关的方向和相关的程度，但是相关分析不能指出变量间相互关系的具体形式，也无法从一个变量的变化来推测另一个变量的变化情况。回归分析则是研究变量之间的数量变化规律的一种方法。

多元迴歸 Multiple Regression

分析抗焦慮劑/安眠劑之使用的影響因子在重度憂鬱症及廣泛性焦慮症病人和一般大眾的處方形態

STATISTICA統計軟體的應用第二講:廻歸與ANOVA

Chapter 8 Liner Regression and Correlation 第八章直线回归和相关

XI. Hilbert Huang Transform (HHT)

3-3 Modeling with Systems of DEs

-Artificial Neural Network- Adaline & Madaline

Analysis of Variance 變異數分析

Population proportion and sample proportion

Chapter 2 簡單迴歸模型.

第 14 章複迴歸與相關分析.

模式识别 Pattern Recognition

SAS 統計程序實作 CONTENTS By DR. Yang , Yi-Chiang /11/11.

次数依变量模型 (Models for Count Outcomes)

第七章 SPSS的非参数检验.

多元回歸及模型 Multiple Regression Model Building

Stochastic Relationships and Scatter Diagrams

Sampling Theory and Some Important Sampling Distributions

第十一章. 簡單直線迴歸與簡單相關 Simple Linear Regression and Simple Correlation

十一、簡單相關與簡單直線回歸分析(Simple Correlations and Simple Linear Regression )

簡單迴歸模型的基本假設用最小平方法(OLS-ordinary least square)找到一個迴歸式：

第14章迴歸分析與複迴歸分析  本章的學習主題 

非均一性的誤差變異數 and SERIAL CORRELATION

统计软件应用 7 主讲人陶育纯 SPSS统计分析统计软件应用 7 主讲人陶育纯教案.

Chapter 14 Simple Linear Regression

The role of leverage in cross-border mergers and acquisitions

Tel: 第11章 SPSS在时间序列预测中的应用周早弘旅游与城市管理学院

第四章相关分析与回归分析 4.1 简单相关分析 4.2 回归分析 4.3 非线性回归.

Interval Estimation區間估計

統計方法的概念與應用一、認識統計（statistics）、測驗（test）、測量（measurement）與評價（evaluation）

線性相關與直線迴歸基本概念線性相關：兩個連續變項的共變關係，且有線性關係。所謂的線性關係乃指兩個變項的關係可以被一條最具

The Nature and Scope of Econometrics

多元迴歸分析.

Linear Regression 一元线性回归分析.

庄文忠副教授世新大学行政管理学系相关分析与简单回归分析庄文忠副教授世新大学行政管理学系 SPSS之应用(庄文忠副教授) 2019/4/7.

MyLibrary ——数字图书馆的个性化服务

Liner regression analysis

生物統計 1 課程簡介 (Introduction)

第3章預測 2019/4/11 第3章預測.

Mechanics Exercise Class Ⅰ

相關統計觀念復習 Review II.

Chp.4 The Discount Factor

Design and Analysis of Experiments Final Report of Project

線性規劃模式 Linear Programming Models

Simple Regression (簡單迴歸分析)

The Bernoulli Distribution

統計學 Power Power of the two-sample t test depends on four factors.

社会研究方法第7讲：社会统计2.

第二章经典线性回归模型：双变量线性回归模型

Review of Statistics.

第八章均值比较与检验 2019/5/10.

Logistic回归 Logistic regression 研究生《医学统计学》.

何正斌博士國立屏東科技大學工業管理研究所教授

统计工具的使用方法主讲人陶育纯统计工具的使用方法主讲人陶育纯

Multiple Regression: Estimation and Hypothesis Testing

簡單迴歸分析與相關分析莊文忠副教授世新大學行政管理學系計量分析一(莊文忠副教授) 2019/8/3.

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

相關係數（Correlation）描述兩個變數X、Y之間的線性相關 Example: data1中的身高及體重如何量化這樣的線性關係呢？ Correlation! Linear correlation!

相關係數（Correlation） By definition, the correlation between X and Y is Its estimate, Pearson’s correlation coefficient

相關係數（Correlation） r>o: positively correlated r<0: negatively correlated r=0: no linear correlation r=0不代表、Y之間沒有關係，有可能只是他們之間的關係不是線性的 →畫圖還是必要的

相關係數（Correlation） R程式：cor(x,y,method = c("pearson", "kendall", "spearman")) ) x: 數值向量或是矩陣 y: 數值向量，當x是矩陣的時候，可以不需輸入

相關係數（Correlation）若想進一步檢定 vs. 檢定統計量 95% confidence interval:

相關係數（Correlation） R程式：cor.test(x, y, alternative = c("two.sided", "less", "greater"), method = c("pearson", "kendall", "spearman"), exact = NULL, conf.level = 0.95, continuity = FALSE, ...) x: 數值向量 y: 數值向量 exact: T或F，表示是否計算exact p-value continuity: 是否需要進行連續校正所以身高與體重有統計顯著的正相關

Practice 請畫出在Surgical data中，liver與clot的散佈圖。請問由圖中，可以看出liver與clot的關係嗎？ Q: 除了看相關性的強度，能不能看彼此如何影響？Regression!

Linear Regression 血壓是否和體重有線性相關；該線性關係如何描述；如何描述血壓和體重、性別、等等的關係。 Y: response variable, dependent variable (say, bp) X: covariate, explanatory variable, independent variable (say, weight)

Linear Regression Q: how does X affect Y? Can we fit a line in the scatter plot? In fact, we should say , where  is called error,  is normal with zero mean and variance 2.

Linear regression using R R程式：lm(formula, data, ...) formula: y~x，其中y是response，x是covariate 3.943=70.8432/17.9663

Linear regression Confidence interval of and ? Use t-distribution with df=n-2 Testing if the coefficient =0? If =0? Use t with df=n-2 An increase of 1kg in Weight leads to an increase of 0.7167 in Bp. If someone weighs 70kg, then his/her bp is estimated by 70.84 + 0.7270 = 121.24 --- interpolation

Practice 想知道在Surgical data中，clot如何影響liver，請建立liver與clot之迴歸模式。如何解釋此模型呢？請問clot對liver的影響是顯著的嗎？

Homework 想知道在Surgical data中，enzyme如何影響 SVtime，請建立enzyme與SVtime之迴歸模式。如何解釋此模型呢？請問enzyme對SVtime的影響是顯著的嗎？

How good is the regression？ 14 How good does the line explain all the variation in y? How good does the fitted correlation of (X,Y) explain Y? 因為定義判斷係數（coefficient of determination）: Pearson’s correlation coefficient In simple linear regression, SSTO SSE SSR deviation of fitted values around the grand mean total deviation in responses around the grand mean deviation of observations around fitted line percentage of variation explained by regression line

Example 15 R2＝0.4149

AVOVA table of regression 16 SSR SSE

Practice 17 在Surgical data中，模式為liver~clot 請問在此模型中，判斷係數為多少

Diagnostics 基本假設：殘差平均為0，相差變異數相同，殘差之間不相關看殘差和index的關係（應該要沒關係） 18 基本假設：殘差平均為0，相差變異數相同，殘差之間不相關看看殘差的分佈情況看殘差和index的關係（應該要沒關係）殘差應該要與解釋變數無關殘差應該要和fitted value無關

Diagnostics 19 If… From minus to positive! Model may not be proper. Time effect? (If x=time) Randomly scattered around zero! Constant var有問題;若X值大則var大;試試加別的X或是weighted LS? Linearity 有問題試試polynomial 或transform X?

Example 20

Diagnostics in R 21

Diagnostics plots to examine 6→fitted model→2→3→1→4→5 22 plots to examine The linear effect of each predictor: or Constant variance: Independence of samples: or Normality assumption: Q-Q plot Other important predictors? Say : Are there outliers: , scatter plot, … If Yes, examine if it is true outlier, or gross error. If Yes, more data near this point. If No, delete the data point before regression analysis. 6→fitted model→2→3→1→4→5

Multiple linear regression 23 Extension of SLR, including more than one predictors in the model Linear? Linear? Difference?

Multiple linear regression 24 Model: : regression coefficients : observed data are independent In matrix form

Multiple linear regression 25 哪些term可以放到X中呢？ Predictors: 如例子中的weight, age, sex Transformations of predictors Polynomials: and Dummy variables and factors Interactions and other combinations of predictors:

Example 26

Inference of regression coefficients 27 和SLR時一樣，用最小平方法 satisfy Gauss-Markov Thm

Inference of regression coefficients 28 和在SLR中相同，我們想要估計的confidence interval, 或是進行檢定，需要先估計出 Recall, in SLR H is called hat matrix SST=SSE+SSR There are p-1 covariates in the regression model. There are n observations and p parameters.

Inference of regression coefficients 29 想要知道整個模式fit如何： Under , E(MSR)= ; otherwise E(MSR)> Define , with df=(p-1,n-p) 在H0之下，，所以如果F偏離1太遠，我們就傾向拒絕H0 H1 是什麼呢？

Inference of regression coefficients 30 若是針對某個，想知道是否和有線性關係在H0之下，所以拒絕H0 ，如果你可以由此推出的confident interval嗎？

Example 31

Practice 32 在Surgical data中想知道影響存活時間（SVtime ）的因素，將存活時間取自然對數。有興趣的因素為clot、prog、enzyme與age 請寫下此迴歸模式請問prog的係數為0嗎？請問此模式顯著嗎？

Homework 在bodyfat資料中，共包含4個變項（Y、X1、X2 、X3） 33 在bodyfat資料中，共包含4個變項（Y、X1、X2 、X3）請分別畫出Y與X1、X2、X3的散佈圖，請問Y和X1 、X2、X3有線性關係嗎？請分別檢定X1、X2、X3的迴歸係數是否為0 請問此模式是顯著的嗎？

Inference of regression coefficient 34 Estimation Least square estimator Normal assumption for interval estimate Testing For overall model , F-test For single , t-test

Example 35

Example 36

Practice 37 在Surgical data中想知道影響存活時間（SVtime ）的因素，將存活時間取自然對數。有興趣的因素為clot、prog、enzyme與age 請寫下此迴歸模式請問此模式之adjusted R2 為多少? 請問prog的係數為0嗎？請問此模式顯著嗎？

Simultaneous tests for partial coefficients 38 To test several parameters simultaneously, it is equivalent to “compare” two regression models, one contains all covariates and the other contains less covariates Use “extra sums of squares” to “distinguish” the two models

Extra sums of squares For two regression models 39 For two regression models model A: model B: Their SSEs will be different The difference is defined as extra sum of squares Similarly, the extra sum of squares 在已有X1的情況下，模式中增加X2的影響

在已有X1、X2的情況下，模式中增加X3的影響 What if using SSR? 40 From thus Similarly, the extra sum of squares And 在已有X1、X2的情況下，模式中增加X3的影響

Example 41 SSE decreases by 320.77; SSR increases by 320.77; Extra sums of squares, SSR(X4 | X1,X2,X3).

Use Extra Sums of Squares to test a partial sets of coefficients 42 Test statistic is Ex: p-value=0.0083 It can be written using extra sum of squares When only 1 coefficient is considered in the test (as in this case), 9.44=(3.073)2; F*=t2!!!

Practice 43 在Surgical data中想知道影響存活時間（SVtime ）的因素，將存活時間取自然對數。模式A內包含clot、prog、enzyme與gender。若再加入性別與prog、enzyme之交互作用，請問交互作用是否應列入模式中考慮?

例子 2008年台灣各縣市與澎湖縣的流浪狗資料各縣市自1999～2008年的流浪狗處理數字，以及各縣市在2008年的其他指標數字 city 縣市名稱 farmArea 各縣市耕地佔總面積的比例 captured 流浪狗累積補抓數目 divorced 各縣市離婚者所佔的比例 adoptedR 被捕抓之流浪狗被認養的比例 unemployed 各縣市失業率 killedR 被捕抓之流浪狗被安樂死之比例 crimeR 各縣市每10萬人刑事案件數目 unknownR 被捕抓之流浪狗被狀況不明之比例 oldR 各縣市老人福利金額佔年度支出的比例 population 各縣市於2008年的人口數 computerR 各縣市平均每100個家庭的電腦數目 graduate 各縣市研究所畢業者的人數

例子以adoptedR（認養比例）為應變數先刪除兩個變數：city(縣市名稱)以及與應變數 adoptedR（認養比例）有高度相關的unknownR( 狀況不明比例)變數將所有的解釋變數放進模型中 model0=lm(adoptedR~captured+killedR+population+ graduate+farmArea+divorced+unemployed+crimeR+ oldR+computerR, data=dogs2, x=T)

Stepwise regression逐步迴歸方法一：使用step函數搭配AIC指標進行逐步迴歸變數篩選 step語法： step(lm物件，direction=“both”,k=2) direction可以選的值為forward, backward, both，其中both是指任何解釋變數被加入模型後，仍有可能在稍候被刪除；或是被刪除後，仍有可能在稍候被加入。 k=2是使用模型的AIC作為篩選標準，若k=log(n)，n 為樣本數，則是使用BIC為準則。

Stepwise regression逐步迴歸方法二：使用step函數搭配BIC指標，進行逐步迴歸變數篩選 summary(step(model0, k=log(nrow(dogs)), method="both"))

Stepwise regression逐步迴歸方法三：使用leaps套件的regsubsets函數 regsubsets語法： regsubsets(X,y, nbest=k1, nvmax=k2, method) 其中X為包含所有解釋變數的矩陣；y為應變數向量；nbest=k1指定所有解釋變數數目相同的候選模型中，都要挑出k1個最佳模型；nvmax=k2指定候選模型中最多包含k2個解釋變數。mothod選項的值可以是”forward”, “backword”, “exhausitive”(即all possible所有可能)與”seqrep”

Stepwise regression逐步迴歸 (A)Forward selection library(leaps) out.forward=regsubsets(as.matrix(dogs2[- 2]),y=dogs2$adoptedR, nbest=1, method="forward") s.forwd=summary(out.forward) 將候選模型的R2(rsq), SSE(rss), R2(adj), Cp, BIC值列出 round(cbind(s.forwd$which, rsq=s.forwd$rsq, adjr2=s.forwd$adjr2, rss=s.forwd$rss, cp=s.forwd$cp, bic=s.forwd$bic),2)

Stepwise regression逐步迴歸以上計算結果中，每一個橫列代表一個候選模型，各直行底下的1代表該直行的解釋變數有出現在某個橫列的候選模型中，0則表示沒有出現。依據最小BIC值-21.8，forward selection的最佳模型為模型編號3：解釋變數包含：killedR, unemployed,與crimeR，其R2值約為77.53%

Stepwise regression逐步迴歸 (B)Backward selection out.backward=regsubsets(as.matrix(dogs2[- 2]),y=dogs2$adoptedR, nbest=1, method="backward") s.back=summary(out.backward) #將候選模型的R2(rsq), SSE(rss), R2(adj), Cp, BIC值列出 round(cbind(s.back$which, rsq=s.back$rsq, adjr2=s.back$adjr2, rss=s.back$rss, cp=s.back$cp, bic=s.back$bic),2)

Stepwise regression逐步迴歸依據最小BIC值-25.74，backward selection的最佳模型為模型編號6：解釋變數包含：captured, graduate, farmArea, crimeR, oldR, computerR，其 R2值約為87.43%

All possible subset selection (c)所有可能模型選取法 All possible subset selection out.all=regsubsets(as.matrix(dogs2[- 2]),y=dogs2$adoptedR, nbest=1, method="exhaustive") s.all=summary(out.all) #將候選模型的R2(rsq), SSE(rss), R2(adj), Cp, BIC值列出 round(cbind(s.all$which, rsq=s.all$rsq, adjr2=s.all$adjr2, rss=s.all$rss, cp=s.all$cp, bic=s.all$bic),2)

All possible subset selection 依據最小BIC值-25.74，all possible subset selection的最佳模型為模型編號6：解釋變數包含：captured, graduate, farmArea, crimeR, oldR, computerR，其R2值約為87.43% 可以搭配identify函數畫出all possible 篩選法的 Cp圖，並即時點選最佳的模型。Cp圖的X座標是各候選模型的迴歸係數數目，Y座標是相對的 Cp值。

All possible subset selection q=as.vector(rowSums(s.all$which)) #迴歸係數數目 q plot(q, s.all$cp, xlim=c(0,8),ylim=c(0,18)) abline(0, b=1) identify(q, s.all$cp) 可以看出候選模型編號4的Cp直最靠近45度斜線，具有最佳的Cp值。

Stepwise regression逐步迴歸 (D)Sequential replacement逐次替換法 out.all=regsubsets(as.matrix(dogs2[- 2]),y=dogs2$adoptedR, nbest=1, method="seqrep") s.step=summary(out.step) #將候選模型的R2(rsq), SSE(rss), R2(adj), Cp, BIC值列出 round(cbind(s.step$which, rsq=s.step$rsq, adjr2=s.step$adjr2, rss=s.step$rss, cp=s.step$cp, bic=s.step$bic),2)

Stepwise regression逐步迴歸依據最小BIC值-24.06，逐次替換法選出的最佳模型為模型編號7：解釋變數包含：captured, graduate, farmArea, divorced, crimeR, oldR, computerR，其R2值約為84%

Outliers-Leverage hii值可協助偵測相對於解釋變數x’s的離群值：若 hii的值大於2p/n，則第i個觀察值可能是離群值，其中p-1為解釋變數數目。 hatvalues函數可以算出模型的槓桿值程式： 2*7/23 (hii=hatvalues(model1)) which(as.vector(hii)>2*7/23) 結果顯示，第22個觀察值可能是相對於解釋變數的離群值。

Outliers-Cook’s Distance Cook’s D指標值di若大於1(Cook and Weisberg, 1982)，則第i個觀察值可能是離群值。 cooks.distance函數可算出di值程式： cooks.distance(model1) which(as.vector(cooks.distance(model1))>1) 從Cook’s D指標來看，沒有任何di值大於1，沒有離群值。

Outliers-T化殘差值程式 (root.MSE=summary(model1)$sigma) #MSE^0.5 hii=hatvalues(model1) student.residual=resid/(root.MSE*sqrt(1-hii)) which(as.vector(abs(student.residual))>2.5) 從殘差值可知，第1、3、9、16、20個觀察值的t化殘差大於2.5，有可能是離群值。

Outliers-T化殘差值也可畫出t化殘差vs.Fitted values，並使用identify 函數來即時點選出可能的離群值 plot(fitted(model1),student.residual) abline(h=0) identify(fitted(model1),student.residual)

Outliers-T化去點殘差值若沒有離群值存在，則此一統計量會服從t(n-p-1)分佈（p=迴歸係數數目，包含beta0）流浪狗資料，n=23, p=7, 因此n-p-1=23-7-1=15 若t化去點殘差絕對值大於t(0.95,15)則該觀察值可能是離群值。此外，由於離群值的偵測往往是一次針對所有觀察值來檢查，因此一般建議用Bonferroni校正的α*值: α*/2= α/2n n為樣本數。程式 jackknife.residual=rstudent(model1)

Outliers-T化去點殘差值 (1)先用α＝0.1，不做Bonferronni校正： qt(0.95,15) which(as.vector(abs(jackknife.residual))>qt(0.95,15)) 指出第9、15、17、20個觀察值可能是離群值。 (2)採用Bonferronni校正 0.1/(2*23) qt(0.1/(2*23),15,lower.tail=F) which(as.vector(abs(jackknife.residual))>3.354188) 採用Bonferronni校正後，沒有任何殘差值超過界限，但要注意的是， α值經過校正後，由於與樣本數n成反比，得出的查表值會更大，導致偵測結果趨於保守，離群值需有很大的殘差值才會被偵測出來。

Outliers-T化去點殘差值畫出t化去點殘差vs.Fitted values，並使用identify 函數即時點選出可能的離群值 plot(fitted(model1),jackknife.residual) abline(h=0) identify(fitted(model1),jackknife.residual)

Influential observations 影響點影響點的偵測可由槓桿值（leverages）、Cook’s D、DFBETAS、DFFITS等指標偵測出來 Leverages（hii） 2*7/23 (hii=hatvalues(model1)) which(as.vector(hii)>2*7/23) 結果顯示，第22個觀察值可能是影響點。

Influential observations 影響點 Cook’s D Cook’s D衡量每一個觀察值被移除後，對於迴歸係數估計值的影響是否顯著。通常以F(0.5, p, n-p)查表值當作比較的門檻值 cooks.distance(model1) qf(0.5,7,16) which(as.vector(cooks.distance(model1)))>qf(0.5,7,16) 所有觀察值的Cook’s D都沒有超過0.9457994，所以沒有相對於迴歸係數估計值變化的影響點。

Influential observations 影響點 DFBETAS與Cook’s D一樣是以迴歸係數估計變化量的大小當作影響點的偵測指標，若DFBETAS值大於2，則相對的觀察值可能是影響點 dfbetas(model1) which(dfbetas(model1)>2) DFBETAS指標找不到任何影響點。

Influential observations 影響點 DFFITS衡量觀察值被移除後，對於應變數估計值的影響，若DFFITS值大於2，則相對的觀察值可能是影響點 dffits(model1) which(dffits(model1)>2) 第20個觀察值為影響點

Influential observations 影響點可用滑鼠在圖上點出紅色圓，圈選所屬的觀察直編號 influencePlot(model1) 點選出15、20兩個影響點

共線性解釋變數之間如果存在嚴重的共線性問題，則某些解釋變數的VIF值應該會很大。一般判斷標準是若VIF值大於10，則可能有共線性問題。可用car套件的VIF函數來計算模型各解釋變數的VIF library(car) vif(model1) mean(vif(model1)) 從VIF值發現，雖然解釋變數graduate, oldR, computerR的 VIF值超過平均VIF，但是沒有任何解釋變數的VIF值超過 10，故這個模型的共線性不是很嚴重。