The LITI rule for Text Analytics that you didnât know you neededâŚuntil now!
Recent Library Articles
Recently in the SAS Community Library: Customer complaint call transcripts can end up being quite verbose. SAS' @PeterChristie reveals how to distill relevant info using SAS Text Analytics.
proc sql;
create table t1
as
select *
from
t2
where
name like 'PNB%'
and rn in
(select distinct rn from
t4)
and
datepart(DATE) <= '7Dec2002:00:00:00'd
;quit; I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again. can you please suggest a better way to do it. Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times . Thank you in advance kajal
... View more
The data we have on hand often contains the information we need, but not sequenced in the order required for processing. Reporting, comparing the data in one table to the data in another, and conducting merges or joins all require properly sequenced data. The technique that produces the desired result most quickly varies depending on factors such as how and where the data is stored, the size of the data, and the number of times ordered data will be read.
Data Sources
SAS data sets store data in the order it is written to the disk file. Many SAS processes, like DATA step and PROC PRINT, read and process the data in the order in which it was stored. Consider this data set and the desired report:
Crime data
Row
DATE_OCC
TIME_OCC
AREA
1
01-01-2024
0:00:05
3
2
01-01-2024
0:30:00
4
3
01-08-2024
0:00:16
5
4
01-16-2024
0:32:10
9
5
01-18-2024
0:33:35
7
6
01-26-2024
0:31:40
1
7
01-27-2024
0:13:20
6
8
01-27-2024
0:32:30
8
9
01-30-2024
0:26:40
2
Desired Report
AREA
DATE_OCC
TIME_OCC
1
01-26-2024
0:31:40
2
01-30-2024
0:26:40
3
01-01-2024
0:00:05
4
01-01-2024
0:30:00
5
01-08-2024
0:00:16
6
01-27-2024
0:13:20
7
01-18-2024
0:33:35
8
01-27-2024
0:32:30
9
01-16-2024
0:32:10
Note that the data is stored chronologically, but the report must be written in AREA order. No matter what technique we choose, sorting of the data will be required. So, what difference does it make if we choose one technique over the other? Today, we will look at several techniques we can use to produce this report, and their impact on resources and elapsed time. Some of these techniques hold the data in memory, and will be more useful when the data small enough to fit into memory all at once. Other techniques use significant amounts of disk space and disk I/O. These will be more useful when the data is too large to fit into available memory.
PROC SORT
If you need to read the sorted version of the data multiple times, you might use PROC SORT to create sorted a copy of the data stored in the specified order. For example, this code sorts the data âin placeâ, that is, it reproduces the original dataset stored in the new order:
Code:
proc sort data=la.crime_sample;
by area;
run;
proc print data=la.crime_sample noobs;
var area date_occ time_occ;
run;
Log:
NOTE: PROCEDURE SORT used (Total process time):
real time 0.10 seconds
cpu time 0.01 seconds
âŚ
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.07 seconds
cpu time 0.01 seconds
The process uses a total of 0.17 sec of elapsed time. Here are the simplified steps that happened during processing:
The dataset crime_sample was read into memory from disk.
The data was sorted.
The sorted data was written to disk using a temporary name, for example _temp1.
The original dataset, crime_sample, was then deleted, and
The temporary dataset was renamed crime_sample.
NOTE:
The entire data set is read into memory then written back out to disk. Disk I/O is usually the slowest part of the process, so we expect this to take longer than in-memory techniques.
Because the sorted data is stored on disk, we can subsequently access the data in this order many times without having to sort the data again.
At the end of step 3, we need enough disk space to hold 2 full copies of the data.
You could improve the process a little by using the OUT= option to write the results to a separate library instead of sorting in place. This eliminates the need to rename the output and delete the original. And if the other library is located on a separate disk, this can often improve I/O speed.
One drawback is the need to read the data from disk again each time the data is accessed. If you have enough memory, you can overcome this using the SASFILE statement. With the SASFILE statement, you can load the entire dataset into memory and then use the in-memory copy for subsequent processing without incurring additional disk I/O.
Code:
proc sort data=la.crime_sample
out=work.crime_sample;
by area;
run;
/* Use SASFILE to hold the data in memory */
sasfile work.crime_sample load;
proc print data=work.crime_sample noobs;
var area date_occ time_occ;
run;
Log:
NOTE: PROCEDURE SORT used (Total process time):
real time 0.09 seconds
cpu time 0.01 seconds
âŚ
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds
The process uses a total of 0.14 sec of elapsed time. The time required to sort was about the same, but it took less time for PROC PRINT to read the data because you are using an in-memory copy. Any subsequent step that reads this data will also benefit, because you donât have to get the data from disk each time you use it. But donât forget to close the file to release the memory when you are finished processing!
sasfile work.crime_sample close;
PROC SQL/FedSQL ORDER BY
If you only need the sorted values once, consider using SQL to avoid having to pre-sort the data. PROC SQL and PROC FedSQL will automatically sort the result set as specified by the ORDER BY clause, without the need to write an intermediate copy of the sorted data to disk.
Code:
proc sql;
select area
,date_occ
,time_occ
from la.crime_sample as cs
order by area
;
quit;
Log:
NOTE: PROCEDURE SQL used (Total process time):
real time 0.07 seconds
cpu time 0.00 seconds
The total elapsed time for this process was only 0.07 seconds â about half of that required for the PROC SORT / PROC PRINT combination.
DATA Step Hash Object
Now, we donât often think of using a DATA step to sort data, but itâs possible. Because weâre using a HASH object, weâll once again need enough memory to hold the entire dataset.
Code:
data _null_;
if (_n_ = 1) then do;
declare hash myhash(dataset: "la.crime_sample" /*Input dataset*/
,multidata: "yes" /*Allow duplicate keys*/
,ordered:"ascending"); /*Sort the data by key*/
rc = myhash.definekey('area'); /*Define the key*/
/*List the variables you want in the output dataset*/
rc = myhash.definedata('area','date_occ','time_occ');
myhash.definedone();
call missing(area, date_occ, time_occ); /*precludes a warning*/
end;
/*Write out the output dataset*/
rc = myhash.output(dataset:"work.crime_sample");
run;
Log:
NOTE: DATA statement used (Total process time):
real time 0.10 seconds
cpu time 0.01 seconds
The time required to sort the data is about the same as when using PROC SORT, and PROC SORT has more options and flexibility, so while it can be done, I wouldnât consider using this technique unless I needed the hash object for some other process in the DATA step.
Now, one of the primary reasons I've had to sort data in the past is to perform a DATA step merge. When a BY statement is present in SAS code (except for PROC SORT of course), SAS assumes you have pre-sorted the data, and an error is generated if an out-of-sequence value is encountered. Did you know that one of the joys of using SAS/ACCESS to work with database tables in SAS is that you never have to pre-sort database data? Data is retrieved from the database using generated SQL and, if a BY statement is found in the SAS code, an appropriate ORDER BY clause is automatically included. Pretty slick, no? Can you name other conditions where a BY statement doesn't require pre-sorted data? I can think of at least one đ
Until next time, may the SAS be with you! Mark
Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Part 1 â Append
... View more
Can some one please help to get rid off completely table of contents in proc ods pdf ?
ods pdf file="c:\test.pdf" contents=no;
proc report data=table contents=' ' headline nowindows;by id;
still it is showing left side table of contents as ID and values
I need ID in by variable od proc report but not in table of conents .
... View more
Hello, I am running a two-component mixture model via PROC FMM, and want to compare the coefficient of a variable from the first component to the coefficient of the same variable in the second component. I noticed PROC FMM does not have an ESTIMATE or CONTRAST statement available, and I am wondering if it is otherwise possible to implement this comparison. Using the example below, I would like to compare the coefficient for 'dose' in component 1 to the coefficient for 'dose' in component 2 (and ideally get a confidence interval as well). I appreciate any suggestions. data assay;
label dose = 'Dose of quinoline (microg/plate)'
num = 'Observed number of colonies';
input dose @;
logd = log(dose+10);
do i=1 to 3; input num@; output; end;
datalines;
0 15 21 29
10 16 18 21
33 16 26 33
100 27 41 60
333 33 38 41
1000 20 27 42
;
run;
proc fmm data=assay;
model num = dose logd / dist=Poisson k = 2;
run;
... View more
IND_F = ID1 >= ID2 >= ID3 ; One of the program says as below can someone please help what does that mean? They are all date values. I am seeing value of IND_F = 1 for few records not sure how ? Regards kajal
... View more