18
Jul

How to sample textual data with SAS

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

  • Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
  • Step 2: Generate term-by-document matrix
  • Step 3: Cluster documents through k-means algorithm
  • Step 4: Get top k terms of each cluster
  • Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

Samples from cluster

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

Donut chart of cluster samples

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=2500, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=10, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
 
/* Browse topk terms of the first cluster */
proc cas;
fetch / 
   table={name="cluster_topk_terms",
          where="_cluster_id_=1"};
run;
quit;
 
/* Stratified sampling by cluster      */
%strSampleByCluster(
   docDS=&myData, 
   docClusterDS=doc_clusters, 
   docVar=_document_, 
   clusterVar=_cluster_id_, 
   seed=12345,   
   sampPct=5,
   dsOut=doc_sample_by_cls
);
 
/* Browse sample documents of the first cluster */
proc cas;
fetch / 
   table={name="doc_sample_by_cls",
          where="_cluster_id_=1"};
run;
quit;

How to sample textual data with SAS was published on SAS Users.

8
Jun

SAS Studio: A new way to program in SAS

SAS Studio is the latest way you can access SAS. This newer interface allows users to reach SAS through a web browser, offering a number of unique ways that SAS can be optimized. At SAS Global Forum 2018, Lora Delwiche (SAS) and Susan J Slaughter (Avocet Solutions) gave the presentation, “SAS Studio: A New Way to Program in SAS.” This post reviews the paper, offering you insights of how to enhance your SAS Studio programming performance.

This new interface is a popular one, as it is included in Base SAS and used for SAS University Edition and SAS OnDemand for Academics. It can be considered a self-serving system, since you write programs in SAS Studio itself that are then processed through SAS and delivered results. Its ease of accessibility from a range of computers is putting it in high demand – which is why you should learn how to optimize its use.

How to operate

A SAS server processes your coding and returns the results to your browser, in order to make the programs run successfully. By operating in Programmer mode, you are given the capabilities to view Code, Log, and Results. On the right side of the screen you can write your code, and the toolbar allows you to access the many different tools that are offered.

SAS Studio

Libraries are used to access your SAS data sets, where you can also see the variables contained in each set. You can create your own libraries, and set the path for your folder through SAS Studio.

In order to view each data set, the navigation pane can also be used. Right click on the data set name and select “Open” to access files through this method. These datasets can be adjusted in a number of ways: columns can be shifted around by dragging the headings; column sizes can be adjusted; the top right corner has arrows to view more information; clicking on the column heading will sort that data.

SAS Studio

In order to control your data easily, filters can be used. Filters are accessed by right-clicking the column heading and selecting the filter that best fits your needs.

How to successfully code

A unique feature to SAS Studio is its code editor that will automatically format your code. Clicking on the icon will properly format each statement and put it on its own line. Additionally, syntax help pops up as you type to give you possible suggestions in your syntax, a tool that can be turned on or off through the Preferences window.

One tool that’s particularly useful is the snippet tool, where you can copy and paste frequently used code.

Implementing and Results

After code is written, the Log tool can help you review your code, whereas Results will generate your code carried out after it has been processed. The Results tab will give you shareable items that can be saved or printed for analysis purposes.

Conclusion

These insights offer just a glimpse of all of the capabilities in programming through SAS Studio. Through easy browser access, your code can be shared and analyzed with a few clicks.

Additional Resources

Additional SAS Global Forum Proceedings
SAS Studio Videos
SAS Studio Courses
SAS Studio Programming Starter Guide
SAS Studio Blogs
SAS Studio Community

Other SAS Global Forum Programming Papers of Interest

Code Like It Matters: Writing Code That's Readable and Shareable
Paul Kaefer

Identifying Duplicate Variables in a SAS ® Data Set
Bruce Gilsen

Macros I Use Every Day (And You Can, Too!)
Joe DeShon

Merge with Caution: How to Avoid Common Problems when Combining SAS Datasets
Joshua M. Horstman

SAS Studio: A new way to program in SAS was published on SAS Users.

1
Jun

Is there a “Big Red Button” to use The SAS Platform?

The SAS PlatformFor software users and SAS administrators, the question often becomes how to streamline their approach into the easiest to use system that most effectively completes the task at hand. At SAS Global Forum 2018, the topic of a “Big Red Button” was an idea that got audience members asking – is there a way to have just a few clicks complete all the stages of the software administration lifecycle? In this article, we review Sergey Iglov’s SAS Global Forum paper A ‘Big Red Button’ for SAS Administrators: Myth or Reality?” to get a better understanding of what this could look like, and how it could change administrators’ jobs for the better. Iglov is a director at SASIT Limited.

What is a “Big Red Button?”

With the many different ways the SAS Platform can be utilized, there is a question as to whether there is a single process that can control “infrastructure provisioning, software installation and configuration, maintenance, and decommissioning.” It has been believed that each of these steps has a different process; however, as Iglov concluded, there may be a way to integrate these steps together with the “Big Red Button.”

This mystery “button” that Iglov talked about would allow administrators to easily add or delete parts of the system and automate changes throughout; thus, the entire program could adapt to the administrator’s needs with a simple click.

Software as a System –SAS Viya and cloud based technologies

Right now, SAS Viya is compatible with the automation of software deployment processes through a centralized management. Right now, SAS Viya is compatible with a centralized automated deployment process. Through insights easily created and shared on the cloud, SAS Viya stands out, as users can access a centrally hosted control panel instead of needing individual installations.

Using CloudFormation by Amazon Web Services

At this point, the “Big Red Button” points toward systems such as CloudFormation. CloudFormation allows users of Amazon Web Services to lay out the infrastructure needed for their product visually, and easily make changes that will affect the software. As Iglov said, “Once a template is deployed using CloudFormation it can be used as a stack to simplify resources management. For example, when a stack is deleted all related resources are deleted automatically as well.”

Conclusion

Connecting to SAS Viya, CloudFormation can install and configure the system, and make changes. This would help SAS administrators adapt the product to their needs, in order to derive intelligence from data. While the future potential to use a one-click button is out there for many different platforms, using cloud based software and programs such as CloudFormation enable users to go through each step of SAS Platform’s administration lifecycle efficiently and effectively.

Additional Resources

SAS Viya Brochure
Sergey Iglov: "A 'Big Red Button' for SAS administrators: Myth or Reality?"

Additional SAS Global Forum 2018 talks of interest for SAS Administrators

A Programming Approach to Implementing SAS® Metadata-Bound Libraries for SAS® Data Set Encryption Deepali Rai, SAS Institute Inc.

Command-Line Administration in SAS® Viya®
Danny Hamrick, SAS

External Databases: Tools for the SAS® Administrator
Mathieu Gaouette, Prospective MG inc.

SAS® Environment Manager – A SAS® Viya® Administrator’s Swiss Army Knife
Michelle Ryals, Trevor Nightingale, SAS Institute Inc.

Troubleshooting your SAS® Grid Environment
Jason Hawkins, Amadeus Software Limited

Multi-Factor Authentication with SAS® and Symantec VIP
Jody Steadman, Mike Roda, SAS Institute Inc.

OpenID Connect Opens the Door to SAS® Viya® APIs
Mike Roda, SAS Institute Inc.

Understanding Security for SAS® Visual Analytics 8.2 on SAS® Viya®
Antonio Gianni, Faisal Qamar, SAS Institute Inc.

Latest and Greatest: Best Practices for Migrating to SAS® 9.4
Alec Fernandez, Leigh Fernandez, SAS Institute Inc.

Planning for Migration from SAS® 9.4 to SAS® Viya®
Don B. Hayes, DLL Consulting Inc.; Spencer Hayes, Cached Consulting LLC; Michael Shealy, Cached Consulting LLC; Rebecca Hayes, Green Peach Consulting Inc.

SAS® Viya®: Architect for High Availability Now and Users Will Thank You Later
Jerry Read, SAS Institute Inc.

Taming Change: Bulk Upgrading SAS® 9.4 Environments to a New Maintenance Release
Javor Evstatiev, Andrey Turlov

Is there a “Big Red Button” to use The SAS Platform? was published on SAS Users.

29
May

Top 10 tips for SAS Enterprise Miner based on 20 years’ experience

SAS Enterprise Miner has been a leader in data mining and modeling for over 20 years. The system offers over 80 different nodes that help users analyze, score and model their data. With a wide range of functionalities, there can be a number of different ways to produce the results you want.

At SAS® Global Forum 2018, Principal Systems Engineer Melodie Rush spoke about her experience with SAS® Enterprise Miner™, and compiled a list of hints that she believe will help users of all levels. This article previews her full presentation, Top 10 Tips for SAS Enterprise Miner Based on 20 Years’ Experience. The paper includes images and further details of each of the tips noted below; I’d encourage you to check it out to learn more.

Top Ten Tips for Enterprise Miner

Tip 1: How to find the node you’re looking for

If you struggle finding the node that best fits what you need, there’s a system that can simplify it.

Nodes are organized by Sample, Explore, Modify, Model, and Assess. Find which of these best describes what you are trying to do, and scroll across each node alphabetically for a description.

Tip 2: Add node from diagram workspace

Double click any node on the toolbar to see its properties. An example of the results this presents are shown below:

Top Ten Tips for Enterprise Miner

Tip 3: Clone a process flow

Highlight process flow by dragging your mouse across, right-click or CTRL+C, and Paste or CTRL+V where you want to insert process flow.

Tip 4: New features

  • There’s a new tab, HPDM (High-Performance Data Mining), which contains several new nodes that cover data mining and machine learning algorithms.
  • There are two new nodes under Utility that incorporate Open Source and SAS Viya.
  • The Open Source Integration node allows you to use R language code in SAS Enterprise Miner diagrams.
  • A SAS Viya Code node now incorporates code that will be used in SAS Viya and CAS, and algorithms from SAS Visual Data Mining and Machine Learning.
  • To save and share your results, there are now the Register Model and Save Data nodes under Utility.
  • You can now register models to the SAS Metadata Server to score or compare easily.
  • A Save Data node lets you save training, validation, test, score, or transaction data as SAS, JMP, Excel, CSV or tab-delimited files.

Tip 5: The unknown node

The reporter node under Utility allows you to easily document your Enterprise Miner process flow diagrams. A .pdf or .rtf is created with an image of the process flow.

Tip 6: The node that changes everything

The Metadata node, on the Utility tab, allows you to change metadata information and values in your diagram. You also can capture settings to then apply to data in another diagram.

Tip 7: How to generate a scorecard

A scorecard emphasizes what variables and values from your model are important. Values are reported on a 0 to 1,000 scale, with the higher being more likely the event you’re measuring occurs. To do this, have the Reporter node follow a Score node, and then change the Nodes property to Summary under Reporter node properties.

Tip 8: How to override the 512 level limit

If faced with the error message, “Maximum target levels of 512 exceeded,” your input is resulting in more than 512 distinct results. To get around this, you need to change EM_TRAIN_MAXLEVELS to another value. To do so, either change the macro value in properties

or change the macro value in project start code.

Tip 9: Which variable selection method should I use?

Instead of choosing just one variable selection method, you can combine different ones such as Decision Trees, Forward, Chi-Square, and others. The results can be combined using different selection properties, such as None (no changes made from original metadata), Any (reject a variable if any previous variable selection nodes reject it), All (reject a variable if all of the previous variable selection nodes reject it), and Majority (reject a variable if the majority of the variable selection nodes reject it).

Tip 10: Interpreting neural network

Decision trees can be produced to interpret networks, by changing the Prediction variable to be your Target and the Target variable to be rejected.

Conclusion

With so many options to create models that best suit your preferences, these tips will help sharpen your focus and allow you to use SAS Enterprise Miner more efficiently and effectively. This presentation was one in a series of talks on Enterprise Miner tool presented at SAS® Global Forum 2018.

Additional Resources

SAS Enterprise Miner
SAS Enterprise Learning Tutorials
Getting Started With SAS Enterprise Miner Tutorial Videos

Additional SAS Enterprise Miner talks from Global Forum 2018

A Case Study of Mining Social Media Data for Disaster Relief: Hurricane Irma
Bogdan Gadidov, Linh Le, Analytics and Data Science Institute, Kennesaw State University

A Study of Modelling Approaches for Predicting Dropout in a Business College
Xuan Wang, Helmut Schneider, Louisiana State University

Analysis of Nokia Customer Tweets with SAS® Enterprise Miner™ and SAS® Sentiment Analysis Studio
Vaibhav Vanamala MS in Business Analytics, Oklahoma State University

Analysis of Unstructured Data: Topic Mining & Predictive Modeling using Text
Ravi Teja Allaparthi

Association Rule Mining of Polypharmacy Drug Utilization Patterns in Health Care Administrative Data Using SAS® Enterprise Miner™
Dingwei Dai, Chris Feudtner, The Children’s Hospital of Philadelphia

Bayesian Networks for Causal Analysis
Fei Wang and John Amrhein, McDougall Scientific Ltd.

Classifying and Predicting Spam Messages Using Text Mining in SAS® Enterprise Miner™
Mounika Kondamudi, Oklahoma State University

Image Classification Using SAS® Enterprise Miner 14.1

Model-Based Fiber Network Expansion Using SAS® Enterprise Miner™ and SAS® Visual Analytics
Nishant Sharma, Charter Communications

Monte Carlo K-Means Clustering SAS Enterprise Miner
Donald K. Wedding, PhD Director of Data Science Sprint Corporation

Retail Product Bundling – A new approach
Bruno Nogueira Carlos, Youman Mind Over Data

Using Market Basket Analysis in SAS® Enterprise MinerTM to Make Student Course Enrollment Recommendations
Shawn Hall, Aaron Osei, and Jeremiah McKinley, The University of Oklahoma

Using SAS® Enterprise Miner for Categorization of Customer Comments to Improve Services at USPS
Olayemi Olatunji, United States Postal Service Office of Inspector General

Top 10 tips for SAS Enterprise Miner based on 20 years’ experience was published on SAS Users.

21
May

Technology that gets the most from the Cloud

SAS ViyaSAS Viya Presentations is our latest extension of the SAS Platform and interoperable with SAS® 9.4. Designed to enable analytics to the enterprise, it seamlessly scales for data of any size, type, speed and complexity. It was also a star at this year’s SAS Global Forum 2018. In this series of articles, we will review several of the most interesting SAS Viya talks from the event. Our first installment reviews Hadley Christoffels’ talk, A Need For Speed: Loading Data via the Cloud.

You can read all the articles in this series or check out the individual interviews by clicking on the titles below:
Part 1: Technology that gets the most from the Cloud.


Technology that gets the most from the Cloud

Few would argue about the value the effective use of data can bring an organization. Advancements in analytics, particularly in areas like artificial intelligence and machine learning, allow organizations to analyze more complex data and deliver faster, more accurate results.

However, in his SAS Global Forum 2018 paper, A Need For Speed: Loading Data via the Cloud, Hadley Christoffels, CEO of Boemska, reminded the audience that 80% of an analyst’s time is still spent on the data. Getting insight from your data is where the magic happens, but the real value of powerful analytical methods like artificial intelligence and machine learning can only be realized when “you shorten the load cycle the quicker you get to value.”

Data Management is critical and still the most common area of investment in analytical software, making data management a primary responsibility of today’s data scientist. “Before you can get to any value the data has to be collected, has to be transformed, has to be enriched, has to be cleansed and has to be loaded before it can be consumed.”

Benefits of cloud adoption

The cloud can help, to a degree. According to Christoffels, “cloud adoption has become a strategic imperative for enterprises.” The advantages of moving to a cloud architecture are many, but the two greatest are elasticity and scalability.

Elasticity, defined by Christoffels, allows you to dynamically provision or remove virtual machines (VM), while scalability refers to increasing or decreasing capacity within existing infrastructure by scaling vertically, moving the workload to a bigger or smaller VM, or horizontally, by provisioning additional VM’s and distributing the application load between them.

“I can stand up VMs in a matter of seconds, I can add more servers when I need it, I can get a bigger one when I need it and a smaller one when I don’t, but, especially when it comes to horizontal scaling, you need technology that can make the most of it.” Cloud-readiness and multi-threaded processing make SAS® Viya® the perfect tool to take advantage of the benefits of “clouding up.”

SAS® Viya® can addresses complex analytical challenges and speed up data management processes. “If you have software that can only run on a single instance, then scaling horizontally means nothing to you because you can’t make use of that multi-threaded, parallel environment. SAS Viya is one of those technologies,” Christoffels said.

Challenges you need to consider

According to Christoffels, it’s important, when moving your processing to the cloud, that you understand and address existing performance challenges and whether it will meet your business needs in an agile manner. Inefficiencies on-premise are annoying; inefficiencies in the cloud are annoying and costly, since you pay for that resource.

It’s not the best use of the architecture to take what you have on premise and just shift it. “Finding and improving and eliminating inefficiencies is a massive part in cutting down the time data takes to load.”

Boemska, Christoffels’ company, has tools to help businesses find inefficiencies and understand the impact users have on the environment, including:

  1. Real-time diagnostics looking at CPU Usage, Memory Usage, SAS Workload, etc.
  2. Insight and comparison provides a historic view in a certain timeframe, essential when trying to optimize and shave off costly time when working in cloud.
  3. Utilization reports to better understand how the platform is used.

Optimizing inefficiencies with SAS Viya

But scaling vertically and horizontally from cloud-based infrastructure to speed the loading and data management process solves only part of the problem. Christoffels said SAS Viya capabilities completes the picture. SAS Viya offers a number of benefits in a Cloud infrastructure, Christoffels said. Code amendments that make use of the new techniques and benefits now available in SAS Viya, such as the multi-threaded DATA step or CAS Action Sets, can be extremely powerful.

One simple example of the benefits of SAS Viya, Christoffels said, is that with in-memory processing, PROC SORT is a procedure that’s no longer needed; SAS Viya does “grouping on the fly,” meaning you can remove sort routines from existing programs, which of itself, can cut down processing time significantly.

As a SAS Programmer, just the fact that SAS Viya can run multithreaded, the fact that you don’t have to do these sorts, the way it handles grouping on the fly, the fact that multithreaded nature and capability is built into how you deal with tables are all “significant,” according to Christoffels.

Conclusion

Data preparation and load processes have a direct impact on how applications can begin and subsequently complete. Many organizations are using the Cloud platform to speed up the process, but to take full advantage of the infrastructure you have to apply the right software technology. SAS Viya enables the full realization of Cloud benefits through performance improvements, such as the transposing of data and the transformation of data using the DATA step or CAS Action Sets.

Additional Resources

SAS Global Forum Video: A Need For Speed: Loading Data via the Cloud
SAS Global Forum 2018 Paper: A Need For Speed: Loading Data via the Cloud
SAS Viya
SAS Viya Products


Read all the posts in this series.

Part 1: Technology that gets the most from the Cloud

Technology that gets the most from the Cloud was published on SAS Users.

17
May

What makes a SAS User? Insight and Community: Josh Horstman

During SAS Global Forum 2018, I sat down with four SAS users to get their take on what makes a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS User? Insight and Community: Josh Horstman appeared first on SAS Learning Post.

9
May

Knowledge from the SAS family: SAS Global Forum 2018 papers and videos now available

For those of you who weren't able to attend SAS Global Forum 2018, you can still learn a lot from the content shared there. Gain knowledge from your SAS family. SAS Global Forum 2018 papers and videos now available.

The post Knowledge from the SAS family: SAS Global Forum 2018 papers and videos now available appeared first on SAS Learning Post.

8
May

Top 10 bestselling titles at SAS Global Forum 2018

In this blog post you'll find out the top 10 bestselling titles at SAS Global Forum 2018.

The post Top 10 bestselling titles at SAS Global Forum 2018 appeared first on SAS Learning Post.

8
May

Which random number generator did Thanos use?

WARNING: This blog post references Avengers: Infinity War and contains story spoilers. But it also contains useful information about random number generators (RNGs) -- tempting! If you haven't yet seen the movie, you should make peace with this inner conflict before reading on.

Throughout the movie, Thanos makes it clear that his goal is to eliminate half of the population of every civilization in the universe. With the power of all six infinity stones imbued into his gauntlet, he'll be able to accomplish this with a "snap of his fingers." By the end of the film, Thanos has all of the stones, and then he literally snaps his fingers. (Really? I kept thinking that this was just a figure of speech he used to indicate how simple this will be -- but I guess it works more like the ruby slippers in The Wizard of Oz. Some clicking was required.)

So, Thanos snaps his huge fingers and -- POOF -- there goes half of us. Apparently the universe already had some sort of population-reduction subroutine just waiting for a hacker like Thanos to access it. Who put that there? Not a good plan, universe designer. (Check here to see if you survived the snap.)

But how did Thanos (or the universe) determine which of us was wiped from existence and which of us was spared? I have to assume that it was a seriously high-performing, massively parallel random number generator. And if Thanos had access to 9.4 Maintenance 5 or later (part of the Power [to Know] stone?), then he would have his choice of algorithms.

(Tony Stark has been to SAS headquarters, but we haven't seen Thanos around here. Still, he's welcome to download SAS University Edition.)

Your own RNG gauntlet, built into SAS

I know a little bit about this topic because I talked with Rick Wicklin about RNGs. As Rick discusses in his blog post, a recent release of SAS added support for several new/updated RNG algorithms, including Mersenne twister, PCG, Threefry, and one that introduces hardware-based entropy for "extra randomness." If you want to save yourself some reading, watch our 10-minute discussion here.

Implementing my own random Avengers terminator

I was going to write a SAS program to simulate Thanos' "snap," but I don't have a list of every single person in the universe (thanks GDPR!). However, courtesy of IMDB.com, I do have a list of the approximately 100 credited characters in the Infinity War movie. I wrote a DATA step to pull each name into a data set and "randomly decide" each fate by using the new PCG algorithm and the RAND function with a Bernoulli (binomial) distribution. I learned that trick from Rick's post about simulating coin flips. (I hope I did this correctly; Rick will tell me if I didn't.)

%let algorithm = PCG;
data characters;
  call streaminit(2018,"&algorithm.");
  infile datalines dsd;
  retain x 0 y 1;
  length Name $ 60 spared 8 x 8 y 8;
  input Name;
  Spared = rand("Bernoulli", 0.5);
  x+1;
  if x > 10 then
    do; y+1; x = 1;end;
datalines;
Tony Stark / Iron Man
Thor
Bruce Banner / Hulk
Steve Rogers / Captain America
/* about 96 more */
;
run;

After all of the outcomes were generated, I used PROC FREQ to check the distribution. In this run, only 48% were spared, not an even 50%. Well, that's randomness for you. On the universal scale, I'm not sure that anyone is keeping track.

How many spared

Using a trick I learned from Sample 54315: Customize your symbols with the SYMBOLCHAR statement in PROC SGPLOT, I created a scatter plot of the outcomes. I included special Unicode characters to signify the result in terms that even Hulk can understand. Hearts represent survivors; frowny faces represent the vanished heroes. Here's the code:

data thanosmap;
  input id $ value $ markercolor $ markersymbol $;
  datalines;
status 0 black frowny
status 1 red heart
;
run;
 
title;
ods graphics / height=400 width=400 imagemap=on;
proc sgplot data=Characters noautolegend dattrmap=thanosmap;
  styleattrs wallcolor=white;
  scatter x=x y=y / markerattrs=(size=40) 
    group=spared tip=(Name Spared) attrid=status;
  symbolchar name=heart char='2665'x;
  symbolchar name=frowny char='2639'x;
  xaxis integer display=(novalues) label="Did Thanos Kill You? Red=Dead" 
    labelattrs=(family="Comic Sans MS" size=14pt);
    /* Comic Sans -- get it ???? */
  yaxis integer display=none;
run;

Scatter plot of spared

For those of you who can read, you might appreciate a table with the rundown. For this one, I used a trick that I saw on SAS Support Communities to add strike-through text to a report. It's a simple COMPUTE column with a style directive, in a PROC REPORT step.

proc report data=Characters nowd;
  column Name spared;
  define spared / 'Spared' display;
  compute Spared;
    if spared=1 then
      call define(_row_,"style",
        "style={color=green}");
    if spared=0 then
      call define(_row_,"style",
        "style={color=red textdecoration=line_through}");
  endcomp;
run;

Table of results

Remember, my results here were generated with SAS and don't match the results from the film. (I feel like I need to say that to preempt a few comments.) The complete code for this blog post is available on my public Gist.

Learn more about RNGs

Just as the end of Avengers: Infinity War has sent throngs of viewers to the Internet to find out What's Next, I expect that readers of this blog are eager to learn more about these modern random number generators. Here are the go-to articles from Rick that are worth your review:

Unanswered questions

Before Thanos completed his gauntlet, his main hobby was traveling around the cosmos reducing the population of each civilization "the hard way." With the gauntlet in hand when he snapped his fingers, did he eliminate one-half of the remaining population? Or did the universe's algorithm spare those civilizations that had already been culled? Was this a random sample with replacement or not? In the film, Thanos did not express concern about these details (typical upper management attitude), but the grunt-workers of the universe need to know the parameters for this project. Coders need exact specifications, or else you can expect less-than-heroic results from your infinity gauntlet. I'm pretty sure it says so in the owner's manual.

The post Which random number generator did Thanos use? appeared first on The SAS Dummy.

1
May

What makes a SAS user? Order, logic and magic: Louise Hadden

During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes them a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS user? Order, logic and magic: Louise Hadden appeared first on SAS Learning Post.

Back to Top