Due to an unusual specification, I need to make a connection between SAS VIYA4 and MS Power BI. I have found solutions to create this type of connection between SAS 9.4 Unfortunately, on the forum and on the Internet I do not see any valuable materials on this type of connection from the VIYA system. An additional complication is that the connection must be made by downloading data from IN-Memory tables, but if this is impossible I would like to get any connection. In summary, the in-memory tables sas VIYA4 is the source of the data and Power BI analyzes it. I was wondering if it is possible to use SAS OLPA SERVER for this, to integrate the data contained in SAS VIYA with MS Power BI. Moreover, OLAP SERVER is a desktop tool. I do not hide the fact that ideally there would be an alternative tool running in an on premise solution on the browser side. I am aware of the differences between VIYA and SAS 9.4, and how this idea sounds. The question is do you have any ideas on how to get around any associated inconvenience?
... View more
I am currently analyzing the impact of an intervention on medication numbers using difference-in-difference analysis, but I have encountered several challenges. Following the SAS support instructions, I conducted the difference-in-difference analysis. However, I noticed a discrepancy between my results and SAS's example (Usage Note 61830: Estimating the difference in differences of means). In the example, the value of 'Mean Estimate' in 'Contrast Estimate Results' is identical to the 'Estimate' in 'Least Squares Means Estimate'. However, in my case, these values were different. I suspect this could be due to my use of the negative binomial distribution with a log link, resulting in exponential values. Consequently, I am unsure whether to rely on the 'Mean Estimate' in 'Contrast Estimate Results' or the 'Estimate' in 'Least Squares Means Estimate', and how to interpret the results." Contrast Estimate Results Label Mean Estimate Mean Confidence Limits L'Beta Estimate Standard Error diff in diff 1.51 1.49 0.41 0.0051 a*b Least Squares Means a b Estimate Standard Error z value Pr > |z| 1 1 0.77 0.00434 178.19 <.0001 1 0 0.03 0.00508 6.5 <.0001 0 1 0.72 0.00408 177.71 <.0001 0 0 0.40 0.00426 93.11 <.0001 Least Squares Means Estimate Effect Label Estimation Standard Error z value Pr > |z| time*hospitalize diff in diff 0.41 0.00509 81.01 <.0001
... View more
Hello everyone, I am trying to select alternate columns from my dataset, starting from column number 1 and including every nth alternate column after that. For example, I want to select columns 1, 3, 5, 7, and so on up to column n. Is there a way to make this selection without having to manually specify the column names? Here are my column headers: FINAL_STATUS_01MAY24, FINAL_STATUS_02MAY24, and so on...
I would appreciate any help or advice. Thank you!
... View more
Hi every one;
I am using proc genmod to calculate the mean change in blood pressure from baseline to the first follow-up in association with tertiles of Health Eating Index score. I am using the following code for this purpose:
proc genmod data=BP_long order=internal; class HEI_Tertile time var1 var1 var3/ref=first; model ChangeBP=time HEI_Tertile var1 var1 var3 var4 var5 /type3 lrci; repeated subject=codea/type=exch corrw; run;
However, I am having unexpected results for the mean changes, where high consumers of healthy diet in the top tertile had more increased BP. I am not sure whether this is related to the coding here or transposing of the data to a long foramt. I used the following for transposing and then renamed;
proc sort; by Id; run; proc transpose data=BP_wide out=BP_long; var BP1-BP2; by id; run;
*Renaming the transposed variable; data BP_long; set BP_long (rename=(col1=ChangeBP)); If _Name_ = 'BP1' then Time= 1; Else if _Name_= 'BP2' then Time=2; Drop _Name_ _label_; run;
Thank you so much in advance.
... View more
Today the quality of data is paramount. Every decision, every insight hinges on the reliability and accuracy of the underlying data.
Many of the tools and techniques for improving data quality are locked behind deep technical knowledge and programming skills in different programming languages. What if I told you, it does not have to be that way?
Let me lead you into the world of Entity Resolution, where you will learn about techniques such as Data Identification, Parsing, Standardization, Matching, Clustering and Surviving Records. Entity Resolution attempts to identify different representations of the same data and provides a normalized, standardized master record, which will improve downstream analysis.
Entity Resolution has many different use cases, many of them within Fraud Detection and Anti Money Laundering.
How do we identify the same entity within data and then enhance this entity with information from "contributor" rows, all without writing a single row of code and only using out-of-the-box functionality? Let us get started!
I will be using a table found in Databricks that has some obvious data quality issues, which I will not be able to solve with the tools available in Databricks, unless I have very deep knowledge in SQL and Python programming as well as specific Python packages. You all know by now, how easy it is to connect to Databricks, as explained in my good colleague Cecily Hoffritz's blog:
SAS and Databricks: Your Practical Guide to Data Access and Analysis - SAS Support Communities
The table I will be working with contains customers, both individuals and organizations. To keep it simple, I will focus on the individual customers and how I can find the entities, remove duplicates, standardize the data, and create my master records.
To help me on this journey I will be using the SAS Quality Knowledge Base (QKB) and SAS Studio for engineering. The QKB contains massive amounts of predefined data quality rules and logic and in SAS Studio I can create visual modern data pipelines.
Identification
The first step is to identify whether a customer is an individual or an organization, for which I use the Clean Data Step and its Identification Analysis.
Using the Name column to identify whether it is an Individual or Organization
Once identified, I use the Branch Rows Step to split individuals and organizations into separate tables.
Parsing
To make any sense of this data, I would need a way of Parsing (dividing) it into tokens (components). I apply different rules on these tokens, as the information in them will vary and must be treated differently.
Fortunately, I have a SAS Studio component (step) called Parse Data, that will do just this for me!
As my data is in Swedish, I will be using the Swedish Locale from the QKB to Parse the columns Name and Address.
From the Name column, I keep the Given Name (Name_GIVENNAME) and Family Name (Name_FAMILYNAME) tokens. Extra tokens such as titles and additional Info, prefix and suffix can also be added, but for this Entity Resolution example they are not needed.
From the Address column, I keep Street Name (Address_STREETNAME), City (Address_CITY), Postal Number and Street Number Tokens. Notice that the Street Name has standardization issues, and there are missing values for City and Postal Number. Let us fix that!
Standardization
For the standardization issues I will use the Clean Data Step and its Standardization capabilities.
Now that I have identified, parsed, and standardized my data, I am halfway through my Entity Resolution process.
Match Code Creation
Next step is to create Match Codes. A match code is an encrypted string that represents portions of the original input string. During the match code creation, techniques such as Phonetic Rules, Noise word removal, Standardization and Normalization are used.
You can use different sensitivity levels to determine the amount of information stored in the match code. Use Lower sensitivity levels to sort data into general categories, or higher sensitivity for a closer match.
Phonetic Rules are applied in the matching process and will translate Patric, Patrick, Patrikk to Patrik.
Clustering
With the match codes generated I move on to clustering the data. I will be creating two different clusters where one of them is based on the Match Codes from the Name column, and the other one is based on the combination of Match Codes from Street Name and Street Number. The result of this clustering is that I get a Name cluster and an Address cluster. I will be putting rows with the same Match Code into the same cluster.
For clustering I use a Custom Step called DQ – Clustering that is publicly available on the SAS github for Custom Steps.
Once the clustering process has executed, we can clearly see that we have 2 different persons (NAME_CLUSTER), living on the same address(ADDRESS_CLUSTER).
Master Record Creation
Now it is time for me to create a master record for each person and enhance it with even better information from “Contributor” rows. Looking closer at the Address column, I see that there is missing information, some rows do not have a Postal Number while others are missing information about City. This means that this is cherry picking time where I pick the cherries (values) that suit the best!
To aid me in the cherry-picking process, I use a Custom Step from the SAS Github site called DQ – Surviving Record. This Step allows me to apply rules in the process of creating a surviving (master) record for each entity (Person). As an example, I will let the majority (high occurrence) of “contributor” rows decide which Given Name will be used for each cluster. Also, I am not allowing any missing values in the columns Address_CITY and Address_POSTALNUMBER.
Based on the rules in the DQ – Surviving Record step, values are selected for each cluster, and a Master Record from each cluster will be created.
And once the cherry-picking process is done, I can review the fruit of my labor in my two master records.
Summary
We have now traveled through the Entity Resolution process, all the way from our Databricks table with poor data quality, to a refined result with master records, ready to be used for downstream analytics. And I did not have to write a single row of code!
Of course, In SAS Studio I can access all these capabilities with code as well, if that is what I prefer.
But in the era of democratization, why not democratize functionality as well as data?
Learn more about SAS and Databricks
Harness the analytical power of your Databricks platform with SAS
Data everywhere and anyhow! Gain insights from across the clouds with SAS
Elevated efficiency and reduced cost: SAS in the era of Cloud Adoption
SAS and Databricks: Your Practical Guide to Data Access and Analysis
Data to Databricks? No need to recode - get your existing SAS jobs to SAS Viya in the cloud
Maximize Coding and Data Freedom with SAS, Python and Databricks
... View more