Home

tress

Hey everyone, I’m relatively new to data science and currently working on a project that involves a dataset with over 60 columns. Many of these columns are categorical, with more than 100 unique values each. My issue arises when I try to apply one-hot encoding to these categorical columns. It seems like I’m running into the curse of dimensionality problem, and I’m not quite sure how to proceed from here. I’d really appreciate some advice or guidance on how to effectively handle high-dimensional data in this context. Are there alternative encoding techniques I should consider? Or perhaps there are preprocessing steps I’m overlooking? Any insights or tips would be immensely helpful. Thanks in advance!

Arjayita

/*Model 1*/ Proc logistic data= dormant_pca_cc; class BU_419_DQ E1_B_07; model FCS_bad_9m_Jun19 = BU_419_DQ BU_26_AB_NUM BU_345_HN_NUM RR_CHR_105 BU_162_GG_NUM RR_CHR_169 BU_799_TEB_NUM RR_CHR_111 BU_301_PL_NUM BU_1552_SHC_NUM BU_1577_RIC_NUM RR_CHR_122 E1_B_07 bu_670_uz_num; ods output GlobalTests = Globaltests_full; run; data _null_; set globaltests_full; if test = "Likelihood Ratio" then do; call symput ("ChiSq_full", ChiSq); call symput ("DF_full", DF); END; RUN; /*Model 2*/ Proc logistic data= dormant_pca_cc2; class BU_419_DQ E1_B_07; model FCS_bad_9m_Jun19 = BU_419_DQ BU_26_AB_NUM BU_345_HN_NUM RR_CHR_105 BU_162_GG_NUM RR_CHR_169 BU_799_TEB_NUM RR_CHR_111 BU_301_PL_NUM BU_1552_SHC_NUM BU_1577_RIC_NUM RR_CHR_122 E1_B_07; ods output GlobalTests = Globaltests_reduced; run; data _null_; set globaltests_reduced; if test = "Likelihood Ratio" then do; call symput ("ChiSq_reduced", ChiSq); call symput ("DF_reduced", DF); END; RUN; data = LRT_result; LR =(&ChiSq_full - &ChiSq_reduced); DF = (&DF_full - &DF_reduced); p=1 - probchi(chiSq,DF); RUN; This is a code to check the change in likelihood ratio. I am getting the below error: 1 The SAS System 10:46 Sunday, April 28, 2024 1 ;*';*";*/;quit;run; 2 OPTIONS PAGENO=MIN; 3 %LET _CLIENTTASKLABEL='Test2'; 4 %LET _CLIENTPROCESSFLOWNAME='Model Macro'; 5 %LET _CLIENTPROJECTPATH='C:\Users\8522019\OneDrive - Lloyds Banking Group\Desktop\Likelihood test1.egp'; 6 %LET _CLIENTPROJECTPATHHOST='MMD014713504257'; 7 %LET _CLIENTPROJECTNAME='Likelihood test1.egp'; 8 %LET _SASPROGRAMFILE=''; 9 %LET _SASPROGRAMFILEHOST=''; 10 11 ODS _ALL_ CLOSE; 12 OPTIONS DEV=SVG; 13 GOPTIONS XPIXELS=0 YPIXELS=0; 14 %macro HTML5AccessibleGraphSupported; 15 %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH; 16 %mend; 17 FILENAME EGHTML TEMP; 18 ODS HTML5(ID=EGHTML) FILE=EGHTML 19 OPTIONS(BITMAP_MODE='INLINE') 20 %HTML5AccessibleGraphSupported 21 ENCODING='utf-8' 22 STYLE=HtmlBlue 23 NOGTITLE 24 NOGFOOTNOTE 25 GPATH=&sasworklocation 26 ; NOTE: Writing HTML5(EGHTML) Body file: EGHTML 27 28 data = LRT_result; ____ 180 ERROR 180-322: Statement is not valid or it is used out of proper order. 29 LR =(&ChiSq_full - &ChiSq_reduced); __ 180 ERROR 180-322: Statement is not valid or it is used out of proper order. 30 DF = (&DF_full - &DF_reduced); __ 180 ERROR 180-322: Statement is not valid or it is used out of proper order. 31 p=1 - probchi(chiSq,DF); _ 180 ERROR 180-322: Statement is not valid or it is used out of proper order. 32 RUN; 33 34 %LET _CLIENTTASKLABEL=; 35 %LET _CLIENTPROCESSFLOWNAME=; 36 %LET _CLIENTPROJECTPATH=; 37 %LET _CLIENTPROJECTPATHHOST=; 38 %LET _CLIENTPROJECTNAME=; 39 %LET _SASPROGRAMFILE=; 40 %LET _SASPROGRAMFILEHOST=; 41 42 ;*';*";*/;quit;run; 43 ODS _ALL_ CLOSE; 44 45 46 QUIT; RUN; 47 Can anyone help me correcting the code? That would be great!

Robin_moon

I constructed a mmrm model as below: proc mixed data = have; class id treatment week strata; model chg = base treatment week treatment*week treatment*strata/ ddfm = KR; repeated week/ subject = id type = UN; lsmeans treatment*week/ cl alpha = 0.05 diff ods output Tests3=tests3 lsmeans=lsmeans diffs=diffs; run; (1) id: represents the patient id; (2) treatment: contains 2 group, "drug" and "placebo"; (3) chg: change from baseline of hba1c; (4) base: the value of hba1c at baseline; (5) week: contains 3 levels: "week8", "week16" and "week24" (6 )strata: stratification based on the median of baseline hba1c, ie group1=patients of those who have baseline hba1c <= median of baseline hba1c; group2=patients of those who have baseline hba1c > median of baseline hba1c. In general, I'd like to know how to the test the interaction between strata and treatment groups. I am not sure if I constructed the model in a right way because I looked at the results (as below), the values of Diffs in LS means were pretty close between strata, but p for interaction was significant. (The Diffs in LS means were got from the output file diffs, and the p for interaction were got from the output file test3.) I also checked the Diffs in LS means of week8 and week16, the diffs in LS means were all pretty close between 2 strata, but the p for interaction were all significant (p-value < 0.0001). So I am not sure if I interpret the result in the right way. What data should look quite different if the p for interaction is statistically significant? Thanks.

Robin_moon

I saw a loop code the other day. And it generate files of output_&i (i.e. output_1, output_2,..., output_n). At the end, it row bind all the files as: data all_results; set output_:; run; It seems that we could list files output_1, output_2, ... , output_n in a simple way by using ":" as output_:, so I tried to row bind my files that all end up with "_output" (i.e. 1_output, 2_output, 3_output) in a similar way: data all_results; set :_output; run; But it failed. I'd like to know how to write the code in the right way. Thanks!

DingTao

I have a dataset structured as repeated measurement and Mixed model for repeated measurement is used for analysis. The standard code to get average change from baseline (note that here we are comparing baseline with the mean of the last 4 visits) and the associated p-value has been posted below. I wonder based on this, how can I test the non-inferiority using a margin of -1, e.g., change from baseline is greater than -1. The null hypothesis is that this change will be smaller or equal to -1. Can anyone suggest how to write SAS for this test? proc mixed data=mydata; class id visit sex(ref='F'); model change = age sex baseline visit*baseline/ ddfm=kr fullx; repeated visit / subject = id type = un; lsmeans visit ; estimate 'Average' Intercept 1 age 1 sex 0.5 0.5 visit 0 0 0 0 0 0.25 0.25 0.25 0.25 base & basemean visit*base 0 0 0 0 0 &basemean1 &basemean1 &basemean1 &basemean1 / cl; run;