Recently in the SAS Community Library: SAS' @Sundaresh1 highlights a sometimes overlooked task when applying document embeddings for purposes of similarity-based search. Normalisation of vectors helps obtain relevant matches.
The data platform landscape has obviously seen many new entrants in recent years. One of the most popular ones around is Databricks who are promoting the lakehouse concept, a storage that brings together the best qualities of a data lake and a data warehouse. The beauty of versatile storage is flexible data loading for almost any kind of data. Within that easiness also lies the risk of uncontrolled data hoarding, as we already witnessed in the heydays of Hadoop. With great data comes great responsibility and that responsibility is best implemented with data governance. SAS has helped many of our customers connect these two powerful data and analytics platforms together and embrace them with seamless data access, data governance, and data quality.
Let’s face it, data is only useful when you can properly access it. SAS Viya provides many powerful ways to access all the commonly used data sources today. While SAS Viya provides integration with most data sources, in this blog we focus on Databricks. It can be accessed through a specific Databricks data connection, the Spark connection or the JDBC connection. I’m calling them connections for simplicity, while in most cases there is a SAS/ACCESS interface under the hood.
The image above is from SAS Viya’s Data Explorer and shows how simple it is to define new data connections. My colleague Cecily explains this hands-on in her blog: SAS and Databricks: Your Practical Guide to Data Access and Analysis For clarity, SAS Viya’s connection to Spark is delivered with a JDBC driver for Databricks and enables out-of-the-box connectivity. Using this embedded driver, a Databricks connection can be achieved in either by defining a Spark LIBNAME statement that specifies the connection options OR by defining a Spark LIBNAME statement that specifies a JDBC URL for the target data source in the URL= option.
SAS/ACCESS interfaces in general offer very good performance but finding the optimal configuration may take some planning and testing, as creating a connection from SAS Viya to Databricks is possible through at least ODBC, JDBC and Spark interfaces. SAS continuously updates and improves our SAS/ACCESS interfaces to provide continued compatibility and optimized performance.
Once connected with the data, SAS Viya provides capabilities to track the lineage of data from various sources, such as Databricks. This helps organizations understand how data is used and where it comes from, supporting transparency and traceability. Lineage also helps to understand the effect of planned changes to the data process. A typical scenario would be a requested change to the data model of a source table. That change of course needs to be carried out through the data process all the way to the end result. The example below shows an example of a simple data process. By following the steps in the lineage flow we learn the following things:
Source tables are accessed in Databricks with the target table under the same schema
A 2-table join is executed by Databricks, and result data also remains in Databricks
The result table is then loaded into SAS Viya’s in-memory engine called CAS (Cloud Analytics Server)
A SAS Visual Analytics report has been created based on the CAS in-memory table
In addition to providing Lineage, SAS Viya offers robust access control mechanisms to ensure that only authorized users have access to Databricks (and any other) data that has been introduced to SAS Viya. SAS Viya allows organizations to define and enforce their data governance policies. This includes policies related to data quality, security, and compliance. With role-based access control and tight integration with enterprise authentication systems, SAS Viya can smoothen your access to Databricks. Single sign-on is available to authenticate connections to from SAS Viya to Databricks in Azure by utilizing a Microsoft Entra ID token that is obtained and utilized by SAS Viya’s credential services to allow seamless access.
SAS Viya supports data quality monitoring and profiling of any data, including Databricks. This makes it simple for Data Engineers and Data Stewards to assess and monitor the quality of data, identify any data anomalies, and have the necessary tools take corrective actions, for example with Clean Data and Parse Data steps in SAS Studio flows.
My colleague Patric has explained the data quality process in detail in his data quality blog here: Data Brilliance Unleashed: SAS Data Quality against Databricks - Precision, Performance, Perfection In his blog Patric takes you through the whole quality improvement process, including identification, splitting, standardization, match code creation, clustering and entity resolution, so it’s a wholeheartedly recommended read!
For those developers who prefer to do their data quality in code, SAS Studio includes a collection of data quality code snippets. They can be run as-is or embedded into SAS Studio flows as code steps. You can read more about efficient use of snippets here: Working with snippets
A data catalog is a central metadata repository that helps users discover, understand, and manage all their data assets. SAS Viya includes SAS Information Catalog to discover catalogued data from any supported data source, for example Databricks, making it much easier for the data users to find relevant data and understand its context. If you come from SAS9 background, you are most likely familiar with the concept of metadata. While SAS Viya does not have a similar Metadata Server as SAS9 to manage both technical and business metadata, rest assured, it’s still there. SAS Information Catalog is based on discovery agents set up by the platform administrator that work hard to gather metadata on the data assets connected to your environment.
As SAS Viya’s discovery agents gather the metadata, they also go through a data profiling process that extracts the descriptive data metrics and quality indicators on your data. SAS Information Catalog provides a centralized view of all your metadata, thus helping you to understand the characteristics of your data. A good understanding of the total data asset is key in building comprehensive data governance, and the asset dashboard in SAS Information Catalog does exactly that:
The above image is borrowed from the great blog post SAS Information Catalog: All your information assets under one roof by my colleague Rajeeve Narula. What is great about SAS Information Catalog, once you find the data you’re looking for, you can instantly view the analyzed data metrics with a one click drill-down, an example of a typical column level analysis below:
Much like the SAS Viya platform in general, SAS Information Catalog provides REST APIs accessible from SAS, Python, or shell scripts. SAS Information Catalog REST APIs enable searching and identification of files, tables and other assets based on specific criteria. Developers and data engineers can leverage these APIs to incorporate files and tables into data management tasks, as well as trigger actions or automate workflows. These REST APIs can gather insightful metadata and provide a comprehensive view of your data landscape. With the information gained, data users can explore their data ecosystem effectively, providing a high-level overview of assets and enabling informed decision making. Depending on your task, you can interact with several REST endpoints, examples of these in the image below:
For further insight how to utilize SAS Information Catalog’s REST APIs, have a look at my colleague Bogdan Teleuca’s informative blog post: Leveraging SAS® Information Catalog REST APIs: Programmatically Discovering Data.
Having firm control of your metadata is crucial, but it’s difficult to understand the big picture without a link to the real world. This is where a data glossary comes in with the ability to manage your business terms and link them with your data assets. SAS Information Catalog has a glossary component that enables you to manage business terms and more importantly build connections to the actual data assets. The glossary supports a collaborative approach to managing this information and allows you to:
Create and maintain and add attributes to term types
Create new terms and import delimited lists of terms
Establish relationships between terms and term types
Review terms and term types in the Glossary window
Assign terms to SAS Information Catalog assets
Search for terms in the Search field
A typical hierarchy of business terms in the SAS Viya Glossary looks like in the image below:
With solutions like above, you can bring control and governance to your data. The key thing about having a lot of data is finding the right way to make it work for you. If you’re not a data engineer like me, and do it for the kicks, in the real world there is always a use case to implement and a business goal to accomplish. Data without access, quality and governance is just idle ones and zeroes. No matter where you collect, store, and maintain your data, you will always need tools to manage your data in a controlled and governed manner. Having established control and monitoring procedures for your data lake, there is less chance of it gradually becoming a data dump. A solid and governed data foundation lets you both sleep better at night and get faster from data to value!
Learn more about SAS and Databricks
Harness the analytical power of your Databricks platform with SAS
Data everywhere and anyhow! Gain insights from across the clouds with SAS
Elevated efficiency and reduced cost: SAS in the era of Cloud Adoption
SAS and Databricks: Your Practical Guide to Data Access and Analysis
Data to Databricks? No need to recode - get your existing SAS jobs to SAS Viya in the cloud
Maximize Coding and Data Freedom with SAS, Python and Databricks
Data Brilliance Unleashed: SAS Data Quality against Databricks - Precision, Performance, Perfection
Unlock Seamless Efficiency: SAS Viya's No-Code/Low-Code Experience to Democratize Databricks
Seamless Power, Dual Brilliance: SAS Analytics and Data Management, Now Within Databricks
... View more
The length of this text is a very long text. As per CDISC standards, all variables in datasets should have a maximum length of 200 characters. As a consequence, this text needs to be split into variables TERM1, TERM2, ... TERMX, with each having 200 characters long.
... View more
Hi Folks, I hope you're doing well. I wanted to reach out to see if you could help me with finding the CAS memory Utilization in SAS Viya LTS version 2022.09. I've been trying to locate information like "Memory Used", "Memory Free", "Memory Total", etc., as mentioned in the documentation. The documentation suggests that we can view this information in SAS Environment Manager under the Servers section, specifically under cas-shared-default and then CAS Memory Utilization. However, I couldn't seem to find the CAS Memory Utilization under cas-shared-default. Do you happen to know of any other method or location where we can view the CAS memory Utilization in SAS Viya LTS 2022.09 version? Any guidance or assistance you can provide on this matter would be greatly appreciated. Thank you in advance for your help! Thanks, Hemanth MG
... View more
Hi I have a file uploaded to a big data platform showing correct d.p but somehow when our clients see the same file via ODBC, the decimal place change from 4 to 2, how did that happen? Do I need to setup schema for showing data via ODBC or there is specific settings required dealing with?
... View more
선형회귀에서 최선의 직선 y = mx +b 을 찾기 위해서 실제 값과 예측 값이 최소가 되는 방법을 사용합니다.
https://ko.wikipedia.org/wiki/%EC%84%A0%ED%98%95_%ED%9A%8C%EA%B7%80
최적의 직선을 찾기 위해서 회귀모형의 적합도 지표가 필요합니다.
적합도 지표는 SST = SSR + SSE 라는 개념을 사용합니다.
https://www.researchgate.net/figure/Visualization-of-SSE-SSR-SST_fig17_322398615
■ SST (Total Sum of Squares); 총변동
개별 y의 편차 제곱합으로, 관측값에서 관측값의 평균을 뺀 결과의 총 합의 제곱을 의미한다.
이는, 전체에 대한 변동성을 나타냅니다. 전체 데이터의 총 변동령을 의미합니다.
각 데이터 값이 평균에서 얼마나 벗어나 있는지를 나타냅니다.
■ SSR(Sum of Squares due to Regression); 회귀 제곱합
SSR은 회귀 제곱합으로 예측값(y hat)에서 관측값(y)의 평균을 뺀 결과의 총합을 의미합니다.
직선에 대한 변동성을 의미하며 분석을 통해 설명이 가능한 수치입니다.
회귀 직선이 데이터의 변동을 얼마나 잘 설명하는지를 의미합니다.
예측값(y hat)은 회귀모델에 의해서 예측된 값을 의미합니다.
■ SSE(Sum of Squared Redisuals)
위의 식을 잔차 제곱의 합 RSS(Residual Sum of Squares) 라고 합니다.
실제 관측값(y)와 예측값 사이의 차인 잔차(residual)의 총합을 의미합니다.
위 식에서 y는 실제 데이터 값을 의미하고, ˆy(y hat)은 y = mx + b 단순 선형회귀 식에서 예측값을 의미합니다.
즉, RSS은 회귀식 추정값과 관측값의 평균 간 차이인 회귀 제곱합을 의미합니다.
예측값과 실제 관측값의 차이가 있을 수 있으나 회귀식으로는 설명할 수 없는 설명 불가능 수치를 의미합니다.
오차에 대한 변동성을 의미하고, 해당 값이 작을수록 좋은 모델입니다.
■ R^2 (R Square)
R Square는 결정계수로 회귀 분석에 사용하는 수치로 회귀 모델의 성능에 대한 평가 지표를 의미합니다.
회귀 모델에서 독립변수가 종속변수를 얼마나 잘 설명하는지 보여주는 지표입니다.
결정계수가 높을수록 독립변수가 종속 변수를 잘 설명한다라는 의미입니다.
sashelp 라이브러리에 있는 class 데이터를 사용해서 Simple Linear Regression 예시로 최적의 직선을 찾으려고 합니다.
몸무게(weight)와 키(Height)를 사용해 단순 선형 회귀 분석을 수행합니다.
proc reg data=sashelp.class;
model Weight = Height;
run;
quit;
종속변수 Weight와 독립변수 Height를 사용하여 회귀 모델을 추정합니다.
즉, 키에 따른 몸무게 변화를 예측하는 선형식을 예측합니다.
회귀계수(Parameter Estimates)를 통해 추정한 회귀식은 다음과 같습니다.
위 식은 키가 1증가할 때 마다 몸무게가 3.90 증가한다라는 것을 의미합니다.
결정계수(R Square)값은 0.7705로, 이는 모델이 데이터 변동을 77.05% 설명한다라는 의미입니다.
1에 가까울수록 모델이 데이터를 잘 설명하는 것으로 해당 모델은 높은 설명력을 가진다라고 할 수 있습니다.
... View more