Dr. Baokun Li 经济实验教学中心 商务数据挖掘中心 优化输出和创建新变量 Dr. Baokun Li 经济实验教学中心 商务数据挖掘中心
优化 SAS 输出 TITLE 语句 – 为Proc程序块的输出作标记 LABEL 语句 – 为变量名称作标记 FORMAT 语句 – 为变量的数值作标记
Number of Patients by Clinic 标准输出 The FREQ Procedure Cumulative Cumulative clinic Frequency Percent Frequency Percent ----------------------------------------------------------- A 18 18.00 18 18.00 B 29 29.00 47 47.00 C 36 36.00 83 83.00 D 17 17.00 100 100.00 优化输出 Number of Patients by Clinic Clinical Center clinic Frequency Percent Frequency Percent ---------------------------------------------------------------- Birmingham 18 18.00 18 18.00 Chicago 29 29.00 47 47.00 Minneapolis 36 36.00 83 83.00 Pittsburgh 17 17.00 100 100.00 Output from programs 5 and 6.
Patient Report Headaches 标准输出 The FREQ Procedure Cumulative Cumulative sebl_6 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 70 70.00 70 70.00 2 23 23.00 93 93.00 3 6 6.00 99 99.00 4 1 1.00 100 100.00 优化输出 Patient Report Headaches ------------------------------------------------------------- None 70 70.00 70 70.00 Mild 23 23.00 93 93.00 Moderate 6 6.00 99 99.00 Severe 1 1.00 100 100.00
TITLE 语句 PROC FREQ DATA=tdata; TABLES clinic group sex educ sebl_1 sebl_6; TITLE 'Distribution of Selected Variables'; TITLE2 'on the TOMHS Dataset' ; RUN; TITLE 语句可以放在程序的任何地方。最好是放在过程 步里面 title可随时改变 使用格式为:TITLE(n)‘文字'
Label 语句可以放在数据步或过程步内的任何地方(但不可在中间)。 LABEL clinic = 'Clinical Center'; LABEL group = 'Drug Treatment Group'; LABEL educ = 'Highest Education Attained'; LABEL sebl_1 = 'Patient Report Drowsiness'; LABEL sebl_6 = 'Patient Report Headaches'; Label 语句可以放在数据步或过程步内的任何地方(但不可在中间)。 If in datastep will apply to all procedures. If only a PROC then will apply only to that PROC.
FORMAT brthdate mmddyy10. ; FORMAT clinic $clinic. ; FORMAT group group. ; FORMAT fever headache se. ; 告诉SAS 根据format(格式)显示变量的数值 Format 语句可以放在数据步或过程步内的任何地方 既有内置format (例如日期),也有用户定义的format. 一个format 可用于多个变量 Format以句号结束 (.) 字符型format以$开始
VALUE group 1 = 'Beta Blocker' 2 = 'Calcium Channel Blocker' 如何定义FORMAT PROC FORMAT; VALUE group 1 = 'Beta Blocker' 2 = 'Calcium Channel Blocker' 3 = 'Diuretic' 4 = 'Alpha Blocker' 5 = 'ACE Inhibitor' 6 = 'Placebo'; VALUE gender 1 = 'Men' 2='Women' ; VALUE se 1 = 'None' 2 = 'Mild’ 3 = 'Moderate' 4 = 'Severe'; Format名称并不必是数据集合内一个变量的名称 Format名 Need to distinguish the format name with the variables that are to be formatted.
PROC FORMAT; VALUE $clinic 'A' = 'Birmingham' 'B' = 'Chicago' 'C' = 'Minneapolis' 'D' = 'Pittsburgh' ; 请不要混淆format和将要被格式化的变量 Sas单从PROC FORMAT不能知道你想要格式化哪个变量。你需要用format语句对变量进行格式化。
* 定义的Format没有使用; PROC FREQ; TABLES clinic sebl_6; RUN; ========================================== * 使用了format; FORMAT clinic $clinic. sebl_6 se. ;
INFILE 'd:\data\tomhs3.dat'; Program 6 PROC FORMAT; ... DATA tdata ; INFILE 'd:\data\tomhs3.dat'; INPUT ptid $ clinic $ group sex educ eversmk alcbl sebl_1 sebl_6 ; run; LABEL clinic = 'Clinical Center'; LABEL group = 'Drug Treatment Group'; LABEL educ = 'Highest Education Attained'; LABEL sebl_1 = 'Patient Report Drowsiness'; LABEL sebl_6 = 'Patient Report Headaches'; LABEL alcbl = 'Alcoholic Drinks Per Week'; LABEL eversmk = 'Ever Smoke Cigarettes'; PROC FREQ DATA=tdata; TABLES clinic sebl_6; FORMAT clinic $clinic. sebl_6 se. ;
PROC MEANS DATA=tdata N MEAN STD; VAR alcbl; CLASS eversmk; ; FORMAT eversmk smoke. ; TITLE 'PROC MEANS With Variable and Value Labels'; RUN; The MEANS Procedure Analysis Variable : alcbl Alcoholic Drinks Per Week Ever Smoke N Cigarettes Obs N Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Smoker 48 47 5.3829787 6.4892995 Non-smoker 52 52 3.5384615 4.6292401
需掌握的要点 Format在使用之前需先定义 (PROC FORMAT). FORMAT 语句可应用定义的format. 数据步里的Label 和 format 语句可用于其后的 PROCs 过程步里面的Label 和 format 语句只能用于该过程步 PROC本身
为网络或Word/Excel文件创建HTML 表 ODS LISTING CLOSE; ODS HTML FILE='educ.html'; PROC FREQ DATA=tdata ; TABLES educ; FORMAT educ educ.; TITLE 'HTML Output From PROC FREQ'; RUN; ODS HTML CLOSE; The output will go to the html file rather than the .lst file. This can then be viewed on the web or inserted into word. Can also create RTF files. Note: Run statement is needed here, and ODS statement must be after run.
插入Word的html文件
教育级别变量educ的解释 创建一个新变量 grad1= 1 如果大学毕业 = 2 否则
Program 7 DATA tdata; INFILE '/home/ph5420/data/tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. @123 sbp12 3. ; * 这样的话,缺失值被变换为2; if educ < 7 then grad1 = 2 ; else if educ >=7 then grad1 = 1 ; * 下面的两种方法结果一样且是正确的; if educ < 7 and educ ne . then grad2 = 2; else if educ >=7 then grad2 = 1; * IN 是SAS的一个特殊函数; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1;
PROC FORMAT; VALUE grad 1-6 = '< College Graduate' 7-9 = 'College Graduate'; VALUE sbpcat LOW - 119 = '< 120' 120-139 = '120-139' 140-HIGH = '140+' ; 注意: LOW 不包括缺失值
最初的变量值为 1-9 缺失值被变成2 PROC FREQ DATA=tdata; TABLES grad1 grad2 grad3 educ ; FORMAT educ grad.; RUN; grad1 Frequency Percent ---------------------------------- 1 43 43.00 2 57 57.00 grad2 Frequency Percent 1 43 43.43 2 56 56.57 Frequency Missing = 1 grad3 Frequency Percent educ Frequency Percent ----------------------------------------------- < College Graduate 56 56.57 College Graduate 43 43.43 最初的变量值为 1-9 缺失值被变成2
* 把变量sbp12 变换成 3 个级别; if sbp12 = . then sbp12c = . ; else if sbp12 < 120 then sbp12c = 1 ; else if sbp12 < 140 then sbp12c = 2 ; else if sbp12 >=140 then sbp12c = 3 ; With if-then-else definitions SAS stops reading after the first true statement
PROC FREQ DATA=tdata; TABLES sbp12c sbp12; FORMAT sbp12 sbpcat.; RUN; OUTPUT The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ------------------------------------------------------------ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 sbp12 Frequency Percent Frequency Percent < 120 36 39.13 36 39.13 120-139 43 46.74 79 85.87 140+ 13 14.13 92 100.00
if sbp12 = . then sbp12c = . ; else * 代价惨重的简单错误; if sbp12 = . then sbp12c = . ; else if sbp12 < 120 then sbp12c = 1 ; else if sbp12 < 140 then sbp12 = 2 ; else if sbp12 >=140 then sbp12c = 3 ; PROC FREQ DATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12cat Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51 How come no values of 2 and why so many missing?
创建新变量时需知道: 新变量的初始值是缺失值; 2. 缺失值 < 任何值; 3. 数值变量缺失值的指代值为.; 4. 类别变量缺失值的指代值为‘ ’; if sbp = . then ... if clinic = ' ' then ...
如何检查新建立的变量是对的? 显示新变量和原变量 PROC PRINT DATA=temp (OBS=20); VAR educ college ; 在新变量和原变量上运行PROC MEANS,确保两个变量的缺失值个数相同. PROC MEANS DATA=temp; VAR educ college; 在新变量和原变量上运行PROC FREQ. PROC FREQ DATA=temp; TABLES educ college;