T-SQL Merge Statement- How to subset the target

Good day everybody,

This post won’t cover how a T-SQL MERGE STATEMENT works, instead we are going to talk about a specific real life scenario. How do we subset the TARGET in a T-SQL MERGE STATEMENT?

A MERGE STATEMENT is very commonly used to apply a CDC (Change Data Capture) process, as part of this process we may find tables with millions of records that for obvious reasons we want to load them incrementally into our destination table using any kind of data processing tool (I use ETL-SSIS) and a Stored Procedure to MERGE these records (CDC Process).

So far so good, it looks like a normal scenario, assuming that we have a last updated date audit column in our Source table that we can easily use and a created date, we can apply filters to our source query and incrementally load records. Now, I’d like to add that there’s a specific requirement that changed the way we do the MERGE STATEMENT, for various reasons (i.e. audit columns not getting updated when the attributes change) we need to incrementally load the data for a window of 40 days, something like this:

 
SELECT Column1,Column2,..
FROM
SourceTable
WHERE PrimaryKeyDate> GETDATE()-40

Then, we load these records into a staging table. We use the Staging table to apply a CDC process against our destination table. In this case a MERGE STATEMENT with the following structure will handle the :Inserts,Updates,Deletes.

 
MERGE INTO TargetTable AS TARGET
USING (SELECT Column1, Column2...
FROM SourceTable) AS SOURCE
 ON TARGET.PrimaryKey=SOURCE.PrimaryKey
AND TARGET.PrimaryKeyDate=SOURCE.PrimaryKeyDate --In this case, one of the date fields is a Primary Key

WHEN NOT MATCHED BY TARGET
 THEN
    INSERT (Column1,
             Column2...)
    VALUES (SOURCE.Column1,
            SOURCE.Column2,
              ....
             )

WHEN MATCHED AND (TARGET.Hash_Column<>SOURCE.Hash_Column)
    THEN
       UPDATE SET
            TARGET.Column1=SOURCE.Column1
            ,TARGET.Column2=SOURCE.Column2
            ,....

WHEN NOT MATCHED BY SOURCE
   THEN
     UPDATE SET TARGET.Flag='Deleted'

The issue appears when the MERGE STATEMENT compares the SOURCE (With 40 days of data) against the Target (ALL previously loaded records), because of the way the DELETE works, the MERGE statement will delete (in my example a soft delete) all the records from the target that are not part of the 40 Day window.

So, a quick fix for this was to COMPARE apples to apples, in other words if I’m comparing 40 days of data, I need the target to subset the same 40 days of data. To subset the TARGET in a Merge statement you have to use a CTE (Common Table Expression) , this is how I did it…

 

;WITH CTEtargetSubset AS (SELECT * FROM TargetTable WHERE PrimaryKeyDate>GETDATE()-40)

MERGE INTO CTEtargetSubset AS TARGET
USING....--exactly the same code explained above

And it worked perfectly, what do you think? I’m pretty sure there are many workarounds and alternatives, if you have better suggestions please comment below.

Cheers.

Notes:
Use Caution with SQL Server Merge Statement

Post-Deployment scripts: Conditional Logic using SQLCMD and :r command (NOEXEC approach)

Hey everybody,

The other day we were trying to customize one of our post-deployment scripts by adding conditional statements and boolean/logical variables, the purpose of this customization was to give flexibility to the Stakeholders/DBAs by turning OFF or ON a specific functionality (create an additional trigger). This additional functionality is added by calling/parsing additional T-SQL statements from a specified file into the statement cache (:r command).

The easiest way to do this was to implement an IF statement and call the additionalFunctionality.sql file if the functionality variable was ON

it looked something like this

:setvar functionalityOn true

--Post Deployment Code is here

IF ('$(functionalityOn)' = 'true'
  begin
   print 'Installing Additional Functionality'
   :r additionalFunctionality.sql
  end

Unfortunately, this code throws an error out

Incorrect syntax near 'GO'

In our case, the additionalFunctionality.sql file contains multiple GO statements, breaks, ; and DDL statements that we require. This error occurs because the :r just imports the scripts inline and the GO statements are nested in the IF statement, as you may know already, a GO statement is a batch separator, in other words, it is not SQL, this will break the syntax of the BEGIN and END block causing an error.

I found a few alternatives/workarounds, however, I did not want to modify the additionalFunctionality.sql file, so the best alternative was to use the SET NOEXEC statement.

It was very simple and worked like a charm

:setvar functionalityOn true

--Post Deployment Code is here

IF ('$(functionalityOn)' <> 'true'
  SET NOEXEC ON
   
:r additionalFunctionality.sql --this will not be executed if SET NOEXEC is ON
SET NOEXEC OFF                 --setting back the default value to OFF
 

Let me explain, as you can see I just reversed the logic a little bit. By setting the NOEXEC ON it will validate the syntax of any of the following SQL Statements but will not execute them. If the variable is true, it will never change the NOEXEC setting to ON and it will execute the :r command. It is very important to turn this setting OFF again, if you don’t do so, any of the SQL statements in the Database will be executed.

This is a simple workaround to add Conditional statements to your Post-Deployment scripts that require to call SQL files by using the :r command.

Let me know if there is a better alternative. I’m using Visual Studio Data Tools 2013 and SQL Server 2012 Developer Edition.

Cheers,
Saul

Cannot open PySpark Shell in CentOS

Hey,

Last week I installed a Cloudera Distribution Hadoop (Express) using a CentOS-based 7.0 Azure VM (Standard D11 v2), I will post about the details and steps of how I did this in the future, after I turned on all the services (including Spark) using Cloudera Manager, I executed the following command in order to open the PySpark Shell:

 PYSPARK_DRIVER_PYTHON=ipython pyspark 

However, I got the following ClassNotFound Exception

 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/launcher/Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.launcher.Main.  Program will exit.

This is clearly telling us that our JAVA version is not the expected version for PySPARK so I ran the following

echo $JAVA_HOME 

According to this, I have the following version ../jdk1.6.0_45, PySpark was expecting at least Java 7 which I already have installed in this host, to quickly fix this we just need to modify the JAVA_HOME environment variable (you may want to add this to your $PATH variable as JAVA_HOME will be reset once we log off). For now I’ll temporary update the JAVA_HOME variable

  export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera 

If you run PySpark Shell again,

 PYSPARK_DRIVER_PYTHON=ipython pyspark 

you’ll get a different message, in this case the tool is trying to open, however I’m still struggling with permissions issues getting the following:

 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=myuser, access=WRITE, inode="/user":admin:admin:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265)

now this can be easily be fixed by giving permissions to the user or using sudo command to access with privileges, I did the latter

sudo PYSPARK_DRIVER_PYTHON=ipython pyspark 

PySparkWelcome

Finally I was ready to play with PySpark, RDDs, Temp tables. 🙂 Let me know what you think or if you had a different problem.

References:
https://spark.apache.org/docs/0.9.0/python-programming-guide.html
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

SQL Error: Arithmetic overflow error converting nvarchar to data type numeric

You’ve probably  seen the following error:

There are a few causes for this error, but basically, this error is telling us that what we are trying to convert is bigger (overflows) than the data type that we defined for that  attribute.

Let’s ignore the requirements for now but for some reason we’re storing numbers in an nvarchar attribute.

Next, let’s explain the problem, we have a source column/attribute and a target column/attribute,

Source: it is an nvarchar(40)  that stores numbers with long decimals, i.e.
‘699.934239999999999999999999999999999999’

Target: it is an nvarchar(40) that stores numbers with only 4 decimals, i.e.  ‘699.9342’ or numeric (18,4)

So, in my mind, the solution looked like this:

1.-Convert the source value into numeric (18,4) as this was the requirement (we can store other data in this column, but if we store numbers, the client would like to format them as numeric (18,4)).

2.-Convert it back to nvarchar(40) already cleaned up, showing only 4 decimals.

Easy right? WRONG!

Now, let’s replicate the error,  the problem that I was facing was that my code looked like this.

 declare @source nvarchar(40) = '699.934239999999999999999999999999999999'
select @source
select CONVERT( nvarchar(40),CONVERT(numeric(22,4),@source))

Error comes up:

Basically SQL is trying to squeeze the 699.9342…… into the target data type, as we can see it is not automatically done, we need to workaround this and make it work.

Googling a little bit, I found the SQL Function STR that returns a float to a char by defining the total lenght of that char and the number of decimals that you’d like to show from that float.

Finally, I came up with the following:

 declare @source nvarchar(40) = '699.934239999999999999999999999999999999'
select @source
select ltrim(CONVERT( nvarchar(40),STR(cast (@source as float),22,4)))

It worked, however I had to add the LTRIM function to clean all the extra spaces to the left given that it is a CHAR result.

So, what do you think? Are there any other alternatives to this? Let me know your thoughts…

Cheers

References:

STR (Transact-SQL)