多元回歸及模型 Multiple Regression Model Building 統計學 Statistics 多元回歸及模型 Multiple Regression Model Building
講題綱要 二次多項式多元迴歸--The quadratic regression model 虛擬變數的引用--Dummy variables 資料轉換的應用--Using transformation in regression models 自變數間共線性問題--Collinearity 迴歸模型的建立與探討--Model building 多元迴歸模型的綜合考量-- Pitfalls in multiple regression and ethical considerations
Population Y-intercept 線性複迴歸模式 1. 某個變數和其它變數之間的線性關係 Population Y-intercept Population slopes 隨機誤差(Random error) 相依或反應(response) 變數 獨立或探討 (explanatory)變數 11
母體複迴歸模式 觀測值 Bivariate model 期望值 12
樣本複迴歸模式 Bivariate model 13
估計係數之詮釋 ^ ^ 1. 第k個斜率係數(slope, k) 2. Y-截距(0) ^ ^ 在所有其它X變數固定下, Xk改變一個單位時, 平均Y改變k的量 Example: If 1 = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X1), Given the Number of Sales (X2) fixed 2. Y-截距(0) 在所有Xk = 0時, 平均之Y值 ^ ^ ^ 17
二次多項式多元迴歸 The Quadratic Regression Model The relationship between one response variable and one or more explanatory variables is a quadratic polynomial function It is useful when scatter diagram indicates a non-linear relationship Quadratic model: The second explanatory variable is the square of the first variable
二次多項式多元迴歸模型 Quadratic Regression Model (continued) Quadratic models may be considered when scatter diagram takes on the following shapes: Y X1 2 > 0 X1 Y 2 > 0 X1 Y 2 < 0 Y 2 < 0 X1 2 = the coefficient of the quadratic term
二次項模型的檢定Testing for Significance: Quadratic Model Testing for overall relationship Similar to test for linear model F test statistic = Testing the quadratic effect Compare quadratic model with the linear model Hypotheses (No 2nd order polynomial term) (2nd order polynomial term is needed)
廣告大小與回應範例1 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 回應 廣告大小 流通 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Is this model specified correctly? What other variables could be used (color, photo etc.)? 18
廣告大小與回應範例1: 殘差分析Residual Analysis 觀察值與期望值的比較 No Discernable Pattern
廣告大小與回應範例1: t Test for Quadratic Model Testing the quadratic effect Compare quadratic model in size with the linear model Hypotheses (No quadratic term in size) (Quadratic term is needed in size)
廣告大小與回應範例1結論: Is a quadratic model in size needed on replies of News Paper? Test at = 0.05. H0: 2 = 0 H1: 2 0 df = 3 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 6.2*10-15 Do not reject H0 at = 0.05 Reject H Reject H .025 .025 There is not sufficient evidence for the need to include quadratic effect of size on reply. Z -3.182 3.182
使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 廣告與回應1.
廣告大小與回應範例2 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 你在銘傳時報的廣告部門工作. 你想找出廣告大小(公分平方) 對讀者回應次數的效應(單位百次). 你所收集資料如下: 回應 廣告大小 流通 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Is this model specified correctly? What other variables could be used (color, photo etc.)? 5 28 9 18
廣告大小與回應範例2: 殘差分析Residual Analysis 觀察值與期望值的比較 Discernable Pattern
廣告大小與回應範例2: t Test for Quadratic Model Testing the quadratic effect Compare quadratic model in size with the linear model Hypotheses (No quadratic term in size) (Quadratic term is needed in size)
廣告大小與回應範例2解答: Is a quadratic model in size needed on replies of News Paper? Test at = 0.05. H0: 2 = 0 H1: 2 0 df = 4 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = -2.848 Reject H0 at = 0.05 Reject H Reject H .025 .025 There is a sufficient evidence for the need to include quadratic effect of size on replies. Z -2.776 2.776
使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 廣告與回應2.
暖屋用油與溫度及隔離範例:Heating Oil Example (0F) Determine whether a quadratic model is needed for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.
暖屋用油與溫度及隔離範例: Residual Analysis (continued) May be some non-linear relationship No Discernable Pattern
暖屋用油與溫度及隔離範例: t Test for Quadratic Model (continued) Testing the quadratic effect Compare quadratic model in insulation with the linear model Hypotheses (No quadratic term in insulation) (Quadratic term is needed in insulation)
暖屋用油與溫度及隔離範例: Example Solution Is a quadratic model in insulation needed on monthly consumption of heating oil? Test at = 0.05. H0: 3 = 0 H1: 3 0 df = 11 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 1.6611 Do not reject H0 at = 0.05 Reject H Reject H .025 .025 There is not sufficient evidence for the need to include quadratic effect of insulation on oil consumption. Z -2.2010 2.2010
使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the heatingoil example.
虛擬變數模型的使用: Dummy Variable Models Categorical explanatory variable (dummy variable) with two or more levels: Yes or no, on or off, male or female, Coded as 0 or 1 Only intercepts are different Assumes equal slopes across categories The number of dummy variables needed is (number of levels - 1) Regression model has same form:
純使用虛擬變數模型範例: Dummy-Variable Models 銘統連鎖超級市場想要了解貨品陳列的位置是否會影響寵物玩偶銷售的結果。在店中依照位置所在可將商品陳列區分為:前段Front, 中段Middle, 以及後段 Rear。 現從旗下18家連鎖店中隨機抽出6家店來。 並將相同的寵物玩偶置放於所選出店的不同的位置,經過一個月後再變換位置,每店實施三個月,並記錄其當月銷售總金額(萬元)。 請參考檔案:複迴歸位置影響
純使用虛擬變數模型範例: Dummy-Variable Models Given: Y = Sales X1 = Front Aisle = X2 = Middle Aisle = Front Aisle (X1 = 1, X2 = 1) Middle Aisle (X1=0, X2 = 1) Rear Aisle(X1=0, X2 = 0) F=1 if Front Aisle F=0 if else M=1 if MiddleM=0 if else Mean of Rear Aisle
純使用虛擬變數模型範例: 使用PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 複迴歸位置影響 example.
純使用虛擬變數模型範例圖解1: Dummy-Variable Models (continued) (Location)
純使用虛擬變數模型範例圖解2: Dummy-Variable Models (continued) Y (Sales) Front b0 + b1 b0 Rear Intercepts different b0 + b2 Middle (Location)
純使用虛擬變數模型解說: Dummy-Variable Models 參數估計: (單位:萬元) b0=3.733 ; b1=2.333 ; b2= -1.667 後段(比較的依據)的平均銷售額為: b0=3.733 前段的平均銷售額為: b0 +b1=6.066 中段的平均銷售額為: b0 +b2=2.066 此結果與變異數分析結果一致。 且前段與後段平均差異顯著;中段與後段平均差異也顯著。
含虛擬與數量變數模型 Given: Y = Assessed value of house X1 = Square footage of house X2 = Desirability of neighborhood = Desirable (X2 = 1) Undesirable (X2 = 0) 0 if undesirable 1 if desirable Same slopes
含虛擬與數量變數模型圖解 Y (Assessed Value) Desirable Location Same slopes b0 + b2 Undesirable Intercepts different b0 X1 (Square footage)
含虛擬與數量變數模型: 使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 房價與大小鄰居 example.
含虛擬與數量變數模型係數解說1 據報導男性大學生在進入職場時起薪較相同女性起薪為高,大約2000元。 : Y: 大學畢業生的工作薪資(千元) Y: 大學畢業生的工作薪資(千元) : 年資年增1.5 0 女性 1 男性
含虛擬與數量變數模型係數解說2
含虛擬與數量變數模型係數解說2 (continued) With the same footage, a split-level home will have an estimated average assessed value of 18.84 thousand dollars more than a Condo. With the same footage, a ranch home will have an estimated average assessed value of 23.53 thousand dollars more than a Condo.
含交互作用多元迴歸模型Interaction Regression Model Hypothesizes interaction between pairs of X variables Response to one X variable varies at different levels of another X variable Contains two-way cross product terms Can be combined with other models e.g.: Dummy variable model
交互作用所產生的影響 Effect of Interaction Given: Without interaction term, effect of X1 on Y is measured by 1 With interaction term, effect of X1 on Y is measured by 1 + 3 X2 Effect changes as X2 increases
交互作用模型及係數範例Interaction Example Y = 1 + 2X1 + 3X2 + 4X1X2 Y Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 12 8 Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 X1 0.5 1 1.5 Effect (slope) of X1 on Y does depend on X2 value
交互作用交乘項的產生:Interaction Regression Model Multiply X1 by X2 to get X1X2. Run regression with Y, X1, X2 , X1X2
虛擬變數含交乘模型範例 MALE = 0 if female and 1 if male MARRIED = 1 if married; 0 if not DIVORCED = 1 if divorced; 0 if not MALE•MARRIED = 1 if male married; 0 otherwise = (MALE times MARRIED) MALE•DIVORCED = 1 if male divorced; 0 otherwise = (MALE times DIVORCED)
虛擬變數含交乘模型範例 (continued)
虛擬變數含交乘模型範例解說 Female Single: Married: Divorced: MALE Single: Married: Difference Main Effects : MALE, MARRIED and DIVORCED Interaction Effects : MALE•MARRIED and MALE•DIVORCED
交互作用項的檢測 Hypothesize interaction between pairs of independent variables Contains 2-way product terms Hypotheses: H0: 3 = 0 (no interaction between X1 and X2) H1: 3 0 (X1 interacts with X2)
綜合應用範例 銘傳就業輔導中心欲了解學生畢業後薪資待遇情形,進行調查得到以下12位畢業校友的薪資以及其相關的年資、性別狀況: 請檢測並建立適當的預估模型.
綜合應用範例圖解
綜合應用模型建立及係數解說 Y: 薪資,單位為元 年資:為數量變數 性別:虛擬變數;男性為1、女性為0 年資性別:交互作用;女性為0、 Example: Y: 薪資,單位為元 年資:為數量變數 性別:虛擬變數;男性為1、女性為0 年資性別:交互作用;女性為0、 男性為其年資
綜合應用範例: 使用 PHStat做詳盡的解說 PHStat | regression | multiple regression … EXCEL spreadsheet for the 薪資與年資性別 example.
綜合應用範例總結 b0=18593 ; b1=969 ; b2= 867; b4=260 女性平均起薪約為18593元 女性每年調薪約為969元 男性平均起薪約為18593+867=19460元 男性每年調薪約為969+260=1229元
交互作用項的檢測 Hypothesize interaction between pairs of independent variables Contains 2-way product terms Hypotheses: H0: 3 = 0 (no interaction between X1 and X2) H1: 3 0 (X1 interacts with X2)
綜合應用範例之交互作用檢測: 使用 = 0.05 ,檢測性別及年資是否有交互作用;男性女性每年調薪金額(斜率)是否相同. H0: 3 = 0 H1: 3 0 df = 8 Critical Value(s): Test Statistic: Decision: Conclusion: t Test Statistic = 2.988 Reject H0 at = 0.05 Reject H Reject H .025 .025 有充分證據顯示:男性女性每年調薪金額(斜率)的確不同 Z -2.306 2.306
綜合應用模型圖解 Y (薪資) 斜率為b1+b3 斜率也不同,差異為b3 男性薪資 b0 + b2 斜率為b1 女性薪資 X1 (年資)
資料的轉換—以合乎線性迴歸 Using Transformations Requires data transformation Either or both independent and dependent variables may be transformed Can be based on theory, logic or scatter diagrams Non-linear models that can be expressed in linear form Can be estimated by least squares in linear form Require data transformation
自變數相乘方性的Log-Log轉換 Transformed Multiplicative Model (Log-Log) Similarly for X2
平方根轉換: Square Root Transformation 1 > 0 Similarly for X2 1 < 0 Transforms one of the above models to one that appears linear. Often used to overcome heteroscedasticity.
線性—Log轉換: Linear-Logarithmic Transformation 1 > 0 Similarly for X2 1 < 0 Transformed from an original multiplicative model
指數資料的Log—線性轉換:Exponential Transformation(Log-Linear) Original Model 1 > 0 1 < 0 Transformed Into:
使用轉換法後係數的解釋1: Interpretation of Coefficients The dependent variable is logged The coefficient of the independent variable Xk can be approximately interpreted as: a 1 unit change in Xk that leads to an estimated exp(bk) times Yk change in the average of Y The independent variable is logged The coefficient of the independent variable can be approximately interpreted as: a 100 percent change in Xk that leads to an estimated bk*log(2) unit change in the average of Y
使用轉換法後係數的解釋2: Interpretation of Coefficients (continued) Both dependent and independent variables are logged The coefficient of the independent variable can be approximately interpreted as : a 1 percent change in leads to an estimated percentage change in the average of Y. Therefore is the elasticity of Y with respect to a change in
使用轉換法後係數的解釋3: Interpretation of Coefficients (continued) If both Y and are measured in standardized form: And The are called standardized coefficients They indicate the estimated number of average standard deviations Y will change when changes by one standard deviation
共線性相關 Collinearity (Multicollinearity) 1. X變數之間有高度相關High correlation between explanatory variables 2. 係數測量綜合效應Coefficient of multiple determination measures combined effect of the correlated explanatory variables 3. 導致模式中係數不穩定(+/-, 誤差大)Leads to unstable coefficients (large standard error) 4. 通常存在 -- 只是程度大小 5. 例: 同一模式中, 同時使用年齡和身高
偵測Detecting Multicollinearity 1. 檢測相關距陣(correlation matrix) 配對X的相關比(X和Y)相關更甚時 2. 變異數膨脹因素(variance inflation factor, 簡稱VIF) 若 VIFj > 5, Multicollinearity 存在 3. 一些補救方法 再取新的樣本資料, 刪除一個相關的X變數
相關矩陣 (SAS報表) rY1 r12 rY2 Correlation Analysis Pearson Corr Coeff /Prob>|R| under HO:Rho=0/ N=6 RESPONSE ADSIZE CIRC RESPONSE 1.00000 0.90932 0.93117 0.0 0.0120 0.0069 ADSIZE 0.90932 1.00000 0.74118 0.0120 0.0 0.0918 CIRC 0.93117 0.74118 1.00000 0.0069 0.0918 0.0 rY1 r12 rY2 對角線之值
Variance Inflation Factors Computer Output Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214 ADSIZE 1 0.2049 0.0588 3.656 0.0399 CIRC 1 0.2805 0.0686 4.089 0.0264 Variance Variable DF Inflation INTERCEP 1 0.0000 ADSIZE 1 2.2190 CIRC 1 2.2190 VIF1 5
共線性相關的文氏圖解說 Venn Diagrams and Collinearity Large Overlap reflects collinearity between Temp and Insulation Large Overlap in variation of Temp and Insulation is used in explaining the variation in Oil but NOT in estimating and Oil Temp Insulation
共線性相關的檢測 (Variance Inflationary Factor) Used to measure collinearity If is highly correlated with the other explanatory variables.
使用 PHStat檢測共線性相關 PHStat | regression | multiple regression … Check the “variance inflationary factor (VIF)” box EXCEL spreadsheet for the heatingoil example Since there are only two explanatory variables, only one VIF is reported in the excel spreadsheet No VIF is > 5 There is no evidence of collinearity
多元迴歸模型的建立: Model Building Goal is to develop a good model with the fewest explanatory variables Easier to interpret Lower probability of collinearity Stepwise regression procedure Provides limited evaluation of alternative models Best-subset approach Uses the cp statistic Selects model with small cp near p+1
如何建立多元迴歸模型流程:Model Building Flowchart Choose X1,X2,…Xp Run Subsets Regression to Obtain “best” models in terms of Cp Run Regression to find VIFs Any VIF>5? No Yes Do Complete Analysis Remove Variable with Highest VIF Yes More than One? Add Curvilinear Term and/or Transform Variables as Indicated No Remove this X Perform Predictions
多元迴歸模型的綜合考量1 To avoid pitfalls and address ethical issues: Understand that interpretation of the estimated regression coefficients are performed holding all other independent variables constant Evaluate residual plots for each independent variable Evaluate interaction terms
多元迴歸模型的綜合考量2 To avoid pitfalls and address ethical issues: Obtain VIF for each independent variable and remove variables that exhibit a high collinearity with other independent variables before performing significance test on each independent variable Examine several alternative models using best-subsets regression Use other methods when the assumptions necessary for least-squares regression have been seriously violated
本演講總結 Described the quadratic regression model Addressed dummy variables Discussed using transformation in regression models Described collinearity Discussed model building Addressed pitfalls in multiple regression and ethical considerations