Suppose I sample 30 widgets from a manufacturing line over the course of a 24 hour day. The sampling times are not necessarily random or evenly distributed. Each widget is inspected and either passes or fails. So I have data like:
data have ;
input time time5. fail ;
format time time5. ;
cards ;
00:45 0
01:00 1
02:45 0
03:15 1
03:45 1
05:30 1
05:45 0
06:00 1
06:30 0
06:45 1
07:00 0
09:00 1
10:15 1
12:30 1
14:00 0
16:00 0
17:15 0
17:30 0
17:45 0
19:00 0
20:45 0
21:30 0
21:45 0
22:00 0
22:15 0
22:30 0
22:45 0
23:00 0
23:15 0
23:30 0
;
run ;
If I plot the data I get:
And looking at the plot, I see two clusters. There was a high failure rate between 0:00 and 12:30, and after 14:00 the failure rate was zero. Of course usually the pattern isn't as clear. I could have data like:
At first it looks like there may be three clusters, with a low failure rate clusters between 10:00 and 16:00. But then when you think about it there are only four data points in that period. If the failure rate was constant during the day (.3), it would not be unusual to have four consecutive successes (.7**4=.24). So maybe the correct interpretation is that there are no clusters in that data. Or maybe there are two clusters: 00:00-09:30 with a high failure rate and 10:00-24:00 with a lower failure rate.
I think what I'm looking for is a way to group the data into clusters (by time) where consecutive clusters have a failure rate significantly different than the failure rate of neighboring clusters.
Or maybe another thing I might want is to identify any clusters (by time) where I can conclude that the failure rate is, say, <.1 (with 95% confidence).
Would appreciate any thoughts/suggestions. I realize this is not a well-framed question. I feel like I want some sort of cluster analysis? ChatGPT recommended looking into "temporal scan" methods which apparently create different time window sizes and then compare them. But I didn't find much on that approach that seemed helpful.
... View more
Hello
I want to calculate for each numeric variable-
number of rows with null values
number of rows with zero values
number of rows with positive values
number of rows with negative values that not null
I want that the output will be out to a data set
I don't see the desired output
I also want to ask about the user format I have defined.
Is null value not included in both groups??
. ='Missing' Low-<0='Neg'
I want that null be only in first group (. ='Missing')
data have;
input x y z R $;
cards;
1 2 0 a
3 0 -2 b
. 3 3 c
. . . r
4 4 4 e
4 4 0 f
-1 -1 -1 g
;
run;
proc format;
value numeric_Fmt
. ='Missing'
0='Zero'
0<-High='POS'
Low-<0='Neg'
;
run;
proc freq data=have;
format _NUMERIC_ numeric_Fmt.;
tables _NUMERIC_ /out=want missing missprint nocum nopercent;
run;
... View more
Hello,
In want to calculate statistics for all numeric variables.
There are some statistics that I don't know how to calculate for each numeric variable -
Number of rows with zero value (nr_Zero)
Number of rows with missing value (nr_Null)
Number of rows with positive value (nr_POS)
Number of rows with negative value that is not null (nr_Zero)
What is the way to calculate it automatically for each numeric var?
proc means data=sashelp.class sum min max q1 median q3 P1 p5 p95 P99 P75 P25 cv SUM Range N mean std STACKODSOUTPUT maxdec=2;
var _numeric_;
ods output summary=Want1(Rename=(Variable=Var_Name));
run;
ods output nlevels=LEVELS;
proc freq data=sashelp.class nlevels;
tables _numeric_ / noprint ;
run;
Nr_Null
Nr_Zero
Nr_POS
Nr_Neg
... View more
Hello
I want to calculate the following statistics for each numeric variable:
nr,nr_POS,nr_Neg,nr_Nulls,nr_0,nr_distinct_values,AVG,STD,UCL,LCL,Sum,Min,Max,Q1,Median,
Q3,P1,P5,P95,P99,P75,P25,CV,Range
I want to identify automatically the numeric variables.
In this code I perform wide to Long change of data structure.
My question- Is there a better more efficient way to calculate the required statistics? May you show code?
In real life the data set has 2 million rows and 200 numeric variables .
I am not sure if this method of work (Long to wide) is efficient in this case
proc sql;
create table numeric_Vars as
select libname, memname, name, type, length,
format, informat, label
from dictionary.columns
where libname = 'SASHELP' and memname='CLASS' and upcase(type)='NUM'
;
quit;
proc sql noprint;
select name into : Numeric_Vars_List SEPARATED by ' '
from numeric_Vars
;
quit;
%put &Numeric_Vars_List.; /**Age Height Weight**/
proc sql noprint;
select count(*) as nr_numeric_Vars into :nr_numeric_Vars
from numeric_Vars
;
quit;
%put &nr_numeric_Vars.;
data Long_Structure_Numeric_Vars_Data(DROP=&Numeric_Vars_List. J);
set sashelp.class(KEEP=&Numeric_Vars_List.);
array vv{&nr_numeric_Vars.} &Numeric_Vars_List.; /* _NUMERIC_ */
do j=1 to dim(vv);
Var_Name = vname(vv(j));
Var_Value =vv(j);
output;
end;
run;
proc sql;
create table Want1 as
select Var_Name,
count(*) as nr,
sum(case when Var_Value>0 then 1 else 0 end ) as nr_POS,
sum(case when Var_Value<0 AND Var_Value=. then 1 else 0 end ) as nr_Neg,
sum(case when Var_Value=. then 1 else 0 end ) as nr_Nulls,
sum(case when Var_Value=0 then 1 else 0 end ) as nr_0,
count(distinct Var_Value) as nr_distinct_values,
mean(Var_Value) as AVG format=8.2,
STD(Var_Value) as STD format=8.2,
calculated AVG+calculated STD *3 as UCL,
calculated AVG-calculated STD *3 as LCL
from Long_Structure_Numeric_Vars_Data
group by Var_Name
;
quit;
proc means data=sashelp.class sum min max q1 median q3 P1 p5 p95 P99 P75 P25 cv SUM Range /**N mean std**/ STACKODSOUTPUT maxdec=2;
var _numeric_;
ods output summary=Want2(Rename=(Variable=Var_Name));
run;
data want_summary_Report;
merge Want1 Want2;
by Var_Name;
Run;
... View more
Hello
Let's say that I want to create a data set based on sas_help.class data set.
The new data set will have only 2 columns: Var_name , Var_Value
What is the way to do it?
I expect that the new data set will have 95 rows (because in sas_help.class have 19 rows with 5 columns so 19*5=95)
thanks
... View more