多元回歸及模型 Multiple Regression Model Building

Slides:



Advertisements
Similar presentations
人群健康研究的统计方法 预防医学系 指导教师:方亚 电话:
Advertisements

數據挖掘 課程 王海 深圳國泰安教育技術股份有限公司.
Chapter 15 複迴歸.
How to Use SPSS in Biomedical Data analysis
二維品質模式與麻醉前訪視滿意度 中文摘要 麻醉前訪視,是麻醉醫護人員對病患提供麻醉相關資訊與服務,並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式,設計病患滿意度問卷,探討麻醉前訪視內容與病患滿意度之關係,以期分析關鍵品質要素為何,作為提高病患對醫療滿意度之參考。 本研究於台灣北部某醫學中心,通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患,其中實驗組共107位病患,在麻醉醫師訪視之前,安排先觀看麻醉流程衛教影片;另外對照組111位病患,則未提供衛教影片。問卷於麻醉醫師
Performance Evaluation
B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析
多元迴歸 Multiple Regression
分析抗焦慮劑/安眠劑之使用的影響因子在重度憂鬱症及廣泛性焦慮症病人和一般大眾的處方形態
第三章 隨機變數.
STATISTICA統計軟體的應用 第二講:廻歸與ANOVA
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
第四章 测试效度及其 验证方法(一) 湖南师范大学外国语学院 邓 杰 教授.
XI. Hilbert Huang Transform (HHT)
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
3-3 Modeling with Systems of DEs
Analysis of Variance 變異數分析
Population proportion and sample proportion
Descriptive statistics
Chapter 2 簡單迴歸模型.
第 14 章 複迴歸與相關分析.
實 驗 研 究 法 多因子實驗設計 指導老師:黃萬居教授 學生:陳志鴻 m
Differential Equations (DE)
第十章 兩母體之假設檢定 Inferences Based on Two-Samples:
次数依变量模型 (Models for Count Outcomes)
類別資料分析(Categorical Data Analysis)
非線性規劃 Nonlinear Programming
第七章 SPSS的非参数检验.
Stochastic Relationships and Scatter Diagrams
Sampling Theory and Some Important Sampling Distributions
第十一章. 簡單直線迴歸與簡單相關 Simple Linear Regression and Simple Correlation
十一、簡單相關與簡單直線回歸分析(Simple Correlations and Simple Linear Regression )
簡單迴歸模型的基本假設 用最小平方法(OLS-ordinary least square)找到一個迴歸式:
创建型设计模式.
非均一性的誤差變異數 and SERIAL CORRELATION
製程能力分析 何正斌 教授 國立屏東科技大學工業管理學系.
Chapter 14 Simple Linear Regression
Learning Polynomials 台大生機系 方煒.
The role of leverage in cross-border mergers and acquisitions
Ch2 理論建構概論 指導教授 許芳銘博士 報告人 陳渙鏘.
Interval Estimation區間估計
統計方法的概念與應用 一、認識統計(statistics)、測驗(test)、 測量(measurement)與評價(evaluation)
These Views Are Not Necessarily
消費者偏好與效用概念.
線性相關與直線迴歸 基本概念 線性相關:兩個連續變項的共變關係,且有線性關係。所謂 的線性關係乃指兩個變項的關係可以被一條最具
The Nature and Scope of Econometrics
多元迴歸分析.
第四章 测试效度及其 验证方法(一) 湖南师范大学外国语学院 邓 杰 教授.
GRANT UNION HIGH SCHOOL
统 计 学 (第三版) 2008 作者 贾俊平 统计学.
CH6 Pairs Selection in Equity Markets
生物統計 1 課程簡介 (Introduction)
Mechanics Exercise Class Ⅰ
相關統計觀念復習 Review II.
Design and Analysis of Experiments Final Report of Project
Simple Regression (簡單迴歸分析)
The Bernoulli Distribution
第二章 经典线性回归模型: 双变量线性回归模型
Statistics Chapter 1 Introduction Instructor: Yanzhi Wang.
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
Review of Statistics.
名词从句(2).
动词不定式(6).
Logistic回归 Logistic regression 研究生《医学统计学》.
第四章 多组资料均数的比较 七年制医疗口腔《医学统计学》
Multiple Regression: Estimation and Hypothesis Testing
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

多元回歸及模型 Multiple Regression Model Building 統計學 Statistics 多元回歸及模型 Multiple Regression Model Building

講題綱要 二次多項式多元迴歸--The quadratic regression model 虛擬變數的引用--Dummy variables 資料轉換的應用--Using transformation in regression models 自變數間共線性問題--Collinearity 迴歸模型的建立與探討--Model building 多元迴歸模型的綜合考量-- Pitfalls in multiple regression and ethical considerations

Population Y-intercept 線性複迴歸模式 1. 某個變數和其它變數之間的線性關係 Population Y-intercept Population slopes 隨機誤差(Random error) 相依或反應(response) 變數 獨立或探討 (explanatory)變數 11

母體複迴歸模式 觀測值 Bivariate model 期望值 12

樣本複迴歸模式 Bivariate model 13

估計係數之詮釋 ^ ^ 1. 第k個斜率係數(slope, k) 2. Y-截距(0) ^ ^ 在所有其它X變數固定下, Xk改變一個單位時, 平均Y改變k的量 Example: If 1 = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X1), Given the Number of Sales (X2) fixed 2. Y-截距(0) 在所有Xk = 0時, 平均之Y值 ^ ^ ^ 17

二次多項式多元迴歸 The Quadratic Regression Model The relationship between one response variable and one or more explanatory variables is a quadratic polynomial function It is useful when scatter diagram indicates a non-linear relationship Quadratic model: The second explanatory variable is the square of the first variable

二次多項式多元迴歸模型 Quadratic Regression Model (continued) Quadratic models may be considered when scatter diagram takes on the following shapes: Y X1 2 > 0 X1 Y 2 > 0 X1 Y 2 < 0 Y 2 < 0 X1 2 = the coefficient of the quadratic term

二次項模型的檢定Testing for Significance: Quadratic Model Testing for overall relationship Similar to test for linear model F test statistic = Testing the quadratic effect Compare quadratic model with the linear model Hypotheses (No 2nd order polynomial term) (2nd order polynomial term is needed)

廣告大小與回應範例1 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 回應 廣告大小 流通 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Is this model specified correctly? What other variables could be used (color, photo etc.)? 18

廣告大小與回應範例1: 殘差分析Residual Analysis 觀察值與期望值的比較 No Discernable Pattern

廣告大小與回應範例1: t Test for Quadratic Model Testing the quadratic effect Compare quadratic model in size with the linear model Hypotheses (No quadratic term in size) (Quadratic term is needed in size)

廣告大小與回應範例1結論: Is a quadratic model in size needed on replies of News Paper? Test at  = 0.05. H0: 2 = 0 H1: 2  0 df = 3 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 6.2*10-15 Do not reject H0 at  = 0.05 Reject H Reject H .025 .025 There is not sufficient evidence for the need to include quadratic effect of size on reply. Z -3.182 3.182

使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 廣告與回應1.

廣告大小與回應範例2 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 回應 廣告大小 流通 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Is this model specified correctly? What other variables could be used (color, photo etc.)? 5 28 9 18

廣告大小與回應範例2: 殘差分析Residual Analysis 觀察值與期望值的比較 Discernable Pattern

廣告大小與回應範例2: t Test for Quadratic Model Testing the quadratic effect Compare quadratic model in size with the linear model Hypotheses (No quadratic term in size) (Quadratic term is needed in size)

廣告大小與回應範例2解答: Is a quadratic model in size needed on replies of News Paper? Test at  = 0.05. H0: 2 = 0 H1: 2  0 df = 4 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = -2.848 Reject H0 at  = 0.05 Reject H Reject H .025 .025 There is a sufficient evidence for the need to include quadratic effect of size on replies. Z -2.776 2.776

使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 廣告與回應2.

暖屋用油與溫度及隔離範例:Heating Oil Example (0F) Determine whether a quadratic model is needed for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.

暖屋用油與溫度及隔離範例: Residual Analysis (continued) May be some non-linear relationship No Discernable Pattern

暖屋用油與溫度及隔離範例: t Test for Quadratic Model (continued) Testing the quadratic effect Compare quadratic model in insulation with the linear model Hypotheses (No quadratic term in insulation) (Quadratic term is needed in insulation)

暖屋用油與溫度及隔離範例: Example Solution Is a quadratic model in insulation needed on monthly consumption of heating oil? Test at  = 0.05. H0: 3 = 0 H1: 3  0 df = 11 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 1.6611 Do not reject H0 at  = 0.05 Reject H Reject H .025 .025 There is not sufficient evidence for the need to include quadratic effect of insulation on oil consumption. Z -2.2010 2.2010

使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the heatingoil example.

虛擬變數模型的使用: Dummy Variable Models Categorical explanatory variable (dummy variable) with two or more levels: Yes or no, on or off, male or female, Coded as 0 or 1 Only intercepts are different Assumes equal slopes across categories The number of dummy variables needed is (number of levels - 1) Regression model has same form:

純使用虛擬變數模型範例: Dummy-Variable Models 銘統連鎖超級市場想要了解貨品陳列的位置是否會影響寵物玩偶銷售的結果。在店中依照位置所在可將商品陳列區分為:前段Front, 中段Middle, 以及後段 Rear。 現從旗下18家連鎖店中隨機抽出6家店來。 並將相同的寵物玩偶置放於所選出店的不同的位置,經過一個月後再變換位置,每店實施三個月,並記錄其當月銷售總金額(萬元)。 請參考檔案:複迴歸位置影響

純使用虛擬變數模型範例: Dummy-Variable Models Given: Y = Sales X1 = Front Aisle = X2 = Middle Aisle = Front Aisle (X1 = 1, X2 = 1) Middle Aisle (X1=0, X2 = 1) Rear Aisle(X1=0, X2 = 0) F=1 if Front Aisle F=0 if else M=1 if MiddleM=0 if else Mean of Rear Aisle

純使用虛擬變數模型範例: 使用PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 複迴歸位置影響 example.

純使用虛擬變數模型範例圖解1: Dummy-Variable Models (continued) (Location)

純使用虛擬變數模型範例圖解2: Dummy-Variable Models (continued) Y (Sales) Front b0 + b1 b0 Rear Intercepts different b0 + b2 Middle (Location)

純使用虛擬變數模型解說: Dummy-Variable Models 參數估計: (單位:萬元) b0=3.733 ; b1=2.333 ; b2= -1.667 後段(比較的依據)的平均銷售額為: b0=3.733 前段的平均銷售額為: b0 +b1=6.066 中段的平均銷售額為: b0 +b2=2.066 此結果與變異數分析結果一致。 且前段與後段平均差異顯著;中段與後段平均差異也顯著。

含虛擬與數量變數模型 Given: Y = Assessed value of house X1 = Square footage of house X2 = Desirability of neighborhood = Desirable (X2 = 1) Undesirable (X2 = 0) 0 if undesirable 1 if desirable Same slopes

含虛擬與數量變數模型圖解 Y (Assessed Value) Desirable Location Same slopes b0 + b2 Undesirable Intercepts different b0 X1 (Square footage)

含虛擬與數量變數模型: 使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 房價與大小鄰居 example.

含虛擬與數量變數模型係數解說1 據報導男性大學生在進入職場時起薪較相同女性起薪為高,大約2000元。 : Y: 大學畢業生的工作薪資(千元) Y: 大學畢業生的工作薪資(千元) : 年資年增1.5 0 女性 1 男性

含虛擬與數量變數模型係數解說2

含虛擬與數量變數模型係數解說2 (continued) With the same footage, a split-level home will have an estimated average assessed value of 18.84 thousand dollars more than a Condo. With the same footage, a ranch home will have an estimated average assessed value of 23.53 thousand dollars more than a Condo.

含交互作用多元迴歸模型Interaction Regression Model Hypothesizes interaction between pairs of X variables Response to one X variable varies at different levels of another X variable Contains two-way cross product terms Can be combined with other models e.g.: Dummy variable model

交互作用所產生的影響 Effect of Interaction Given: Without interaction term, effect of X1 on Y is measured by 1 With interaction term, effect of X1 on Y is measured by 1 + 3 X2 Effect changes as X2 increases

交互作用模型及係數範例Interaction Example Y = 1 + 2X1 + 3X2 + 4X1X2 Y Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 12 8 Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 X1 0.5 1 1.5 Effect (slope) of X1 on Y does depend on X2 value

交互作用交乘項的產生:Interaction Regression Model Multiply X1 by X2 to get X1X2. Run regression with Y, X1, X2 , X1X2

虛擬變數含交乘模型範例 MALE = 0 if female and 1 if male MARRIED = 1 if married; 0 if not DIVORCED = 1 if divorced; 0 if not MALE•MARRIED = 1 if male married; 0 otherwise = (MALE times MARRIED) MALE•DIVORCED = 1 if male divorced; 0 otherwise = (MALE times DIVORCED)

虛擬變數含交乘模型範例 (continued)

虛擬變數含交乘模型範例解說 Female Single: Married: Divorced: MALE Single: Married: Difference Main Effects : MALE, MARRIED and DIVORCED Interaction Effects : MALE•MARRIED and MALE•DIVORCED

交互作用項的檢測 Hypothesize interaction between pairs of independent variables Contains 2-way product terms Hypotheses: H0: 3 = 0 (no interaction between X1 and X2) H1: 3  0 (X1 interacts with X2)

綜合應用範例 銘傳就業輔導中心欲了解學生畢業後薪資待遇情形,進行調查得到以下12位畢業校友的薪資以及其相關的年資、性別狀況: 請檢測並建立適當的預估模型.

綜合應用範例圖解

綜合應用模型建立及係數解說 Y: 薪資,單位為元 年資:為數量變數 性別:虛擬變數;男性為1、女性為0 年資性別:交互作用;女性為0、 Example: Y: 薪資,單位為元 年資:為數量變數 性別:虛擬變數;男性為1、女性為0 年資性別:交互作用;女性為0、 男性為其年資

綜合應用範例: 使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 薪資與年資性別 example.

綜合應用範例總結 b0=18593 ; b1=969 ; b2= 867; b4=260 女性平均起薪約為18593元 女性每年調薪約為969元 男性平均起薪約為18593+867=19460元 男性每年調薪約為969+260=1229元

交互作用項的檢測 Hypothesize interaction between pairs of independent variables Contains 2-way product terms Hypotheses: H0: 3 = 0 (no interaction between X1 and X2) H1: 3  0 (X1 interacts with X2)

綜合應用範例之交互作用檢測: 使用 = 0.05 ,檢測性別及年資是否有交互作用;男性女性每年調薪金額(斜率)是否相同. H0: 3 = 0 H1: 3  0 df = 8 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 2.988 Reject H0 at  = 0.05 Reject H Reject H .025 .025 有充分證據顯示:男性女性每年調薪金額(斜率)的確不同 Z -2.306 2.306

綜合應用模型圖解 Y (薪資) 斜率為b1+b3 斜率也不同,差異為b3 男性薪資 b0 + b2 斜率為b1 女性薪資 X1 (年資)

資料的轉換—以合乎線性迴歸 Using Transformations Requires data transformation Either or both independent and dependent variables may be transformed Can be based on theory, logic or scatter diagrams Non-linear models that can be expressed in linear form Can be estimated by least squares in linear form Require data transformation

自變數相乘方性的Log-Log轉換 Transformed Multiplicative Model (Log-Log) Similarly for X2

平方根轉換: Square Root Transformation 1 > 0 Similarly for X2 1 < 0 Transforms one of the above models to one that appears linear. Often used to overcome heteroscedasticity.

線性—Log轉換: Linear-Logarithmic Transformation 1 > 0 Similarly for X2 1 < 0 Transformed from an original multiplicative model

指數資料的Log—線性轉換:Exponential Transformation(Log-Linear) Original Model 1 > 0 1 < 0 Transformed Into:

使用轉換法後係數的解釋1: Interpretation of Coefficients The dependent variable is logged The coefficient of the independent variable Xk can be approximately interpreted as: a 1 unit change in Xk that leads to an estimated exp(bk) times Yk change in the average of Y The independent variable is logged The coefficient of the independent variable can be approximately interpreted as: a 100 percent change in Xk that leads to an estimated bk*log(2) unit change in the average of Y

使用轉換法後係數的解釋2: Interpretation of Coefficients (continued) Both dependent and independent variables are logged The coefficient of the independent variable can be approximately interpreted as : a 1 percent change in leads to an estimated percentage change in the average of Y. Therefore is the elasticity of Y with respect to a change in

使用轉換法後係數的解釋3: Interpretation of Coefficients (continued) If both Y and are measured in standardized form: And The are called standardized coefficients They indicate the estimated number of average standard deviations Y will change when changes by one standard deviation

共線性相關 Collinearity (Multicollinearity) 1. X變數之間有高度相關High correlation between explanatory variables 2. 係數測量綜合效應Coefficient of multiple determination measures combined effect of the correlated explanatory variables 3. 導致模式中係數不穩定(+/-, 誤差大)Leads to unstable coefficients (large standard error) 4. 通常存在 -- 只是程度大小 5. 例: 同一模式中, 同時使用年齡和身高

偵測Detecting Multicollinearity 1. 檢測相關距陣(correlation matrix) 配對X的相關比(X和Y)相關更甚時 2. 變異數膨脹因素(variance inflation factor, 簡稱VIF) 若 VIFj > 5, Multicollinearity 存在 3. 一些補救方法 再取新的樣本資料, 刪除一個相關的X變數

相關矩陣 (SAS報表) rY1 r12 rY2 Correlation Analysis Pearson Corr Coeff /Prob>|R| under HO:Rho=0/ N=6 RESPONSE ADSIZE CIRC RESPONSE 1.00000 0.90932 0.93117 0.0 0.0120 0.0069 ADSIZE 0.90932 1.00000 0.74118 0.0120 0.0 0.0918 CIRC 0.93117 0.74118 1.00000 0.0069 0.0918 0.0 rY1 r12 rY2 對角線之值

Variance Inflation Factors Computer Output Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214 ADSIZE 1 0.2049 0.0588 3.656 0.0399 CIRC 1 0.2805 0.0686 4.089 0.0264 Variance Variable DF Inflation INTERCEP 1 0.0000 ADSIZE 1 2.2190 CIRC 1 2.2190 VIF1  5

共線性相關的文氏圖解說 Venn Diagrams and Collinearity Large Overlap reflects collinearity between Temp and Insulation Large Overlap in variation of Temp and Insulation is used in explaining the variation in Oil but NOT in estimating and Oil Temp Insulation

共線性相關的檢測 (Variance Inflationary Factor) Used to measure collinearity If is highly correlated with the other explanatory variables.

使用 PHStat檢測共線性相關 PHStat | regression | multiple regression … Check the “variance inflationary factor (VIF)” box EXCEL spreadsheet for the heatingoil example Since there are only two explanatory variables, only one VIF is reported in the excel spreadsheet No VIF is > 5 There is no evidence of collinearity

多元迴歸模型的建立: Model Building Goal is to develop a good model with the fewest explanatory variables Easier to interpret Lower probability of collinearity Stepwise regression procedure Provides limited evaluation of alternative models Best-subset approach Uses the cp statistic Selects model with small cp near p+1

如何建立多元迴歸模型流程:Model Building Flowchart Choose X1,X2,…Xp Run Subsets Regression to Obtain “best” models in terms of Cp Run Regression to find VIFs Any VIF>5? No Yes Do Complete Analysis Remove Variable with Highest VIF Yes More than One? Add Curvilinear Term and/or Transform Variables as Indicated No Remove this X Perform Predictions

多元迴歸模型的綜合考量1 To avoid pitfalls and address ethical issues: Understand that interpretation of the estimated regression coefficients are performed holding all other independent variables constant Evaluate residual plots for each independent variable Evaluate interaction terms

多元迴歸模型的綜合考量2 To avoid pitfalls and address ethical issues: Obtain VIF for each independent variable and remove variables that exhibit a high collinearity with other independent variables before performing significance test on each independent variable Examine several alternative models using best-subsets regression Use other methods when the assumptions necessary for least-squares regression have been seriously violated

本演講總結 Described the quadratic regression model Addressed dummy variables Discussed using transformation in regression models Described collinearity Discussed model building Addressed pitfalls in multiple regression and ethical considerations