Thoughts of a Crime Show Junkie: Inadmissible Evidence
Recent Library Articles
Recently in the SAS Community Library: SAS' @RhondaWilliams reveals how SAS Law Enforcement Intelligence helps law enforcement agencies expedite detailed data entry using the Evidence Creation feature.
Hi guys,
suppose to have the following:
data DB1;
input ID Discharge;
format Discharge date9.;
cards;
0001 19JUN2017
0001 07SEP2020
0002 17MAR2016
0003 05MAY2016
0003 08FEB2017
0004 22MAR2017
0004 03MAY2017
0004 28MAR2021
;
data DB2;
input ID Discharge Flag NewDate;
format Discharge NewDate date9.;
cards;
0001 19JUN2017 0 .
0001 07SEP2020 1 08SEP2020
0002 17MAR2016 1 18MAR2016
0003 05MAY2016 0 .
0003 08FEB2017 1 09FEB2017
0004 22MAR2017 0 .
0004 03MAY2017 0 .
0004 28MAR2021 1 29MAR2021
;
Is there a way to add a flag to DB1 where for each ID there's the latest date (if there is only one date as for ID 0002 the last date will be the one reported) and then add the new date corresponding to the last + 1 day? Desired output: DB2.
Thank you in advance
... View more
Hi all,
At https://developer.sas.com/apis/rest/v3.5/#filters there is supposed to be a link that explains filter expressions:
For a complete description of filter expressions, see the Filtering reference.
However, when clicking this link (on the word "Filtering") I do not see any such page. It looks like this "got lost" during the switch to the new developers.sas.com web site.
Anyone knows where can I find the doc about API filter expressions?
... View more
Document embeddings (or vectors, as the fashionable like to say) have emerged as a popular area due to the focus on Generative AI. Visual Text Analytics, a SAS Viya offering providing Natural Language Processing (NLP) capabilities, provides an option to train embeddings through the Topics node, backed by the Singular Vector Decomposition algorithm. I encourage you to refer here for a detailed discussion of topics.
The purpose of this article is to highlight a sometimes overlooked task when applying document embeddings for purposes of similarity-based search. Normalisation of vectors helps obtain relevant matches.
Why is this important?
First, let's consider vector embeddings. Simply put, these are numerical representation of text contained within a document. Represented as a series of columns in a table, each column refers to some feature (also known as a dimension) of the source document, and together, these columns represent the document as a whole.
Why do we need to transform a document into embeddings in the first place? Text content can be represented in multiple styles and forms, making it hard to organise, classify and analyse. Motivations for embedding documents include the following:
data standardisation - similar terms are packed as close numbers within dimensions rather than get treated as distinct units
feature engineering - data is organised under different dimensions each of which may carry different meaning
transformation for downstream applications such as analytics and machine learning, for which numbers are more amenable
masking - data is no longer represented as readable text, but as numerical proxies
Now, let's consider the definition of a vector. In mathematics, a vector's a quantity that contains magnitude (length) and direction. Therefore, it isn't just one number (which would make it a scalar) but a set of numbers which represent the number of dimensions.
This is an extremely useful property, since it allows for operations which measure how similar two documents are based on the distance between their vectors. Let's take a simple case involving a two-dimensional vector.
Yes, I know. Poor William's turning over somewhere in Stratford-upon-Avon, but that's the price you pay for fame.
The image above shows vectors for two documents depicted in two-dimensional space. Given their coordinate points, vectors enable calculation of distance between the embedding, a simple and common implementation of which is Euclidean distance. This works out to 1.414 (the approximate square root of 2). As the graph also shows, the vector distance can be viewed as the deviation in direction between the two vectors. A low value indicates that the two documents are more or less similar, which seems to be the case here, albeit to the horror of purists.
However, the utility of the above measure is limited! The reason is that this distance is highly vulnerable to scaling differences which may have been introduced during the embedding training process. Note that embeddings could originate from different sources and we cannot take their representation as standard. This also affects the extent to which we interpret any distance measure that's derived. Is 1.414 small (indicating similar) or large (divergent)? I'll never know until I use a standard. This is achieved through a process known as normalisation.
So, what should I do?
The principle behind vector normalisation is intuitive. Let's consider the same example again.
Let's introduce the unit vector. The unit vector refers to the vector values within the small green box bounded by (1,1). A unit vector is defined as a vector with a magnitude of 1. A magnitude, simply expressed, refers to the length of a vector. Recalling Pythagoras who used to haunt our geometry books, this can be calculated using the formula to calculate the hypothenuse of. a right angled triangle, namely,
Square root ( Sum of squares of dimensions)
Another name for the magnitude is norm, hence the term normalising the vector. To arrive at a normalised value, you simply divide the individual vector values by the magnitude. The resultant vector is a unit vector, which acts as a standard for carrying out similarity search and other vector-based operations.
In our simple example, the unit vectors work out to:
Document
Dimension 1
Dimension 2
Text 1
3 / square root(90)
9 / square root(90)
Text 2
4 / square root(80)
8 / square root(80)
Do it for me, please ?
Gladly. Please refer here for a SAS program which takes in an input table (or dataset) with vectors, and normalises the columns to a magnitude of 1.
The business end of this program can be found between lines 197 to 329. Notice that this program can run on both CAS and SAS (i.e. SAS 9 / SAS Compute or SAS Programming Runtime Environment) engines and uses array logic to normalise the vectors. Also to be noted is the use of the dictionary.columns table which helps us identify all "vector" columns in the input table which conform to a given name pattern. Highly convenient when dealing with typical vector data which does tend to run in the 100s of columns. Imagine writing an array for each one of those!
Give the code a whirl and let me know your feedback. You might also notice that the code has a lot of other programs wrapped around the same, a strong hint of my intention to also make it available as a SAS Studio Custom Step. Soon.
I want to meet you, shake your hand, and shower praise upon you.
Cash would be better. Actually, thank you, but no sweat. I'm happy to answer further questions, though. You can email me by clicking here. Glad you enjoyed it.
... View more
I am currently analyzing the impact of an intervention on medication numbers using difference-in-difference analysis, but I have encountered several problems. Following the SAS support instructions, I conducted the difference-in-difference analysis. However, I noticed a discrepancy between my results and SAS's example 61830. In the example, the value of 'Mean Estimate' in 'Contrast Estimate Results' is identical to the 'Estimate' in 'Least Squares Means Estimate'. However, in my case, these values were different. I suspect this could be due to my use of the negative binomial distribution with a log link, resulting in exponential values. Consequently, I am unsure whether to rely on the 'Mean Estimate' in 'Contrast Estimate Results' or the 'Estimate' in 'Least Squares Means Estimate', and how to interpret this results. Thank you very much. Here is my result: Contrast Estimate Results Label Mean Estimate Mean Confidence Limits L'Beta Estimate Standard Error Pr > ChiSq diff in diff 1.51 1.49-1.52 0.41 0.0051 <.0001 time*intervention Least Squares Means time intervention Estimate SE z value Pr > |z| 1 1 0.77 0.00434 178.19 <.0001 1 0 0.03 0.00508 6.5 <.0001 0 1 0.72 0.00408 177.71 <.0001 0 0 0.40 0.00426 93.11 <.0001 Least Squares Means Estimate Effect Label Estimate SE Z value Pr > |z| time*intervention diff in diff 0.41 0.00509 81.01 <.0001 SAS code: proc genmod data = ALL_DID;
class _MatchID intervention time /ref = first;
model drug_use = time intervention time*intervention/dist = negbin link=log;
repeated subject =_MatchID/type = exch;
estimate "diff in diff" time*intervention 1 -1 -1 1;
lsmeans time*intervention;
lsmestimate time*intervention "diff in diff" 1 -1 -1 1;
run;
... View more
Hi, I'm currently merging multiple datasets and consistently using the following SQL code to verify there are no duplicate IDs at each stage of the process: proc sql; select count(distinct ID) as UniqueIDs, count(*) as NObs from dataset; quit; This approach effectively identifies duplicates when datasets have multiple timepoints, and I make sure to only select one timepoint per ID. Throughout the merging process, my checks confirm that all IDs remain unique (e.g., 1500 observations and 1500 unique IDs). However, after the final merge, the same SQL query unexpectedly shows 1500 observations but only 1300 unique IDs. Manual verification confirms that duplicates are present, despite prior checks showing no such issues. I'm looking for insights into why these duplicates weren’t detected sooner by the SQL query, or if there's a specific merging condition I might have overlooked. Edited to add: The SQL code above is the one that I use to check each dataset for duplicates (which it seems to find pretty well) and after each merge. Merge code (throughout and also the last one): data merged_dataset; merge dataset1 dataset2; by ID; run; Attaching the LOG for the last merge + SQL check. The result for this was 1181 UniqueIDs and 1229 NObs, while previously I was getting 1229 UniqueIDs and 1229 NObs. LOG for the last merge + SQL check
... View more