T-SQL Merge Statement- How to subset the target

Good day everybody,

This post won’t cover how a T-SQL MERGE STATEMENT works, instead we are going to talk about a specific real life scenario, the question I’ll try to answer is How to subset the TARGET in a T-SQL MERGE STATEMENT?

A MERGE STATEMENT is very commonly used to apply a CDC (Change Data Capture) process, as part of this process we may find tables with millions of records that for obvious reasons we want to load them in an incremental load. To do an incremental load I’m doing the following, use a data processing tool (I use ETL-SSIS), dynamically filtering the source query and a Stored Procedure to MERGE these records (CDC Process).

In a perfect World, my audit columns (created_date, last_updated_date) exist in my source table and are working as expected. Now, we surely know that real life is tougher, as you may know already, the audit columns are not getting updated when the attributes change, so that changes the way we do the MERGE STATEMENT, now we’ll need to incrementally load the data for a 40 days window, something like this:

SELECT Column1,Column2,..
WHERE PrimaryKeyDate> GETDATE()-40

Then, we load these records into a staging table. We use the Staging table to apply a CDC process against our destination table. In this case a MERGE STATEMENT with the following structure will handle the :Inserts,Updates,Deletes.

USING (SELECT Column1, Column2...
 ON TARGET.PrimaryKey=SOURCE.PrimaryKey
AND TARGET.PrimaryKeyDate=SOURCE.PrimaryKeyDate --In this case, one of the date fields is a Primary Key

    INSERT (Column1,
    VALUES (SOURCE.Column1,


     UPDATE SET TARGET.Flag='Deleted'

The issue appears when the MERGE STATEMENT compares the SOURCE (With 40 days of data) against the Target (ALL previously loaded records), because of the way the DELETE works, the MERGE statement will delete (in my example a soft delete) all the records from the target that are not part of the 40 Day window.

So, a quick fix for this was to COMPARE apples to apples, in other words if I’m comparing 40 days of data, I need the target to subset the same 40 days of data. To subset the TARGET in a Merge statement you have to use a CTE (Common Table Expression) , this is how I did it…


;WITH CTEtargetSubset AS (SELECT * FROM TargetTable WHERE PrimaryKeyDate>GETDATE()-40)

USING....--exactly the same code explained above

And it worked perfectly, what do you think? I’m pretty sure there are many workarounds and alternatives, if you have better suggestions please comment below.


Use Caution with SQL Server Merge Statement

Post-Deployment scripts: Conditional Logic using SQLCMD and :r command (NOEXEC approach)

Hey everybody,

The other day we were trying to customize one of our post-deployment scripts by adding conditional statements and boolean/logical variables, the purpose of this customization was to give flexibility to the Stakeholders/DBAs by turning OFF or ON a specific functionality (create an additional trigger). This additional functionality is added by calling/parsing additional T-SQL statements from a specified file into the statement cache (:r command).

The easiest way to do this was to implement an IF statement and call the additionalFunctionality.sql file if the functionality variable was ON

it looked something like this

:setvar functionalityOn true

--Post Deployment Code is here

IF ('$(functionalityOn)' = 'true'
   print 'Installing Additional Functionality'
   :r additionalFunctionality.sql

Unfortunately, this code throws an error out

Incorrect syntax near 'GO'

In our case, the additionalFunctionality.sql file contains multiple GO statements, breaks, ; and DDL statements that we require. This error occurs because the :r just imports the scripts inline and the GO statements are nested in the IF statement, as you may know already, a GO statement is a batch separator, in other words, it is not SQL, this will break the syntax of the BEGIN and END block causing an error.

I found a few alternatives/workarounds, however, I did not want to modify the additionalFunctionality.sql file, so the best alternative was to use the SET NOEXEC statement.

It was very simple and worked like a charm

:setvar functionalityOn true

--Post Deployment Code is here

IF ('$(functionalityOn)' <> 'true'
:r additionalFunctionality.sql --this will not be executed if SET NOEXEC is ON
SET NOEXEC OFF                 --setting back the default value to OFF

Let me explain, as you can see I just reversed the logic a little bit. By setting the NOEXEC ON it will validate the syntax of any of the following SQL Statements but will not execute them. If the variable is true, it will never change the NOEXEC setting to ON and it will execute the :r command. It is very important to turn this setting OFF again, if you don’t do so, any of the SQL statements in the Database will be executed.

This is a simple workaround to add Conditional statements to your Post-Deployment scripts that require to call SQL files by using the :r command.

Let me know if there is a better alternative. I’m using Visual Studio Data Tools 2013 and SQL Server 2012 Developer Edition.


Cannot open PySpark Shell in CentOS


Last week I installed a Cloudera Distribution Hadoop (Express) using a CentOS-based 7.0 Azure VM (Standard D11 v2), I will post about the details and steps of how I did this in the future, after I turned on all the services (including Spark) using Cloudera Manager, I executed the following command in order to open the PySpark Shell:

 PYSPARK_DRIVER_PYTHON=ipython pyspark 

However, I got the following ClassNotFound Exception

 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/launcher/Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.launcher.Main.  Program will exit.

This is clearly telling us that our JAVA version is not the expected version for PySPARK so I ran the following

echo $JAVA_HOME 

According to this, I have the following version ../jdk1.6.0_45, PySpark was expecting at least Java 7 which I already have installed in this host, to quickly fix this we just need to modify the JAVA_HOME environment variable (you may want to add this to your $PATH variable as JAVA_HOME will be reset once we log off). For now I’ll temporary update the JAVA_HOME variable

  export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera 

If you run PySpark Shell again,

 PYSPARK_DRIVER_PYTHON=ipython pyspark 

you’ll get a different message, in this case the tool is trying to open, however I’m still struggling with permissions issues getting the following:

 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=myuser, access=WRITE, inode="/user":admin:admin:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242)
        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265)

now this can be easily be fixed by giving permissions to the user or using sudo command to access with privileges, I did the latter

sudo PYSPARK_DRIVER_PYTHON=ipython pyspark 


Finally I was ready to play with PySpark, RDDs, Temp tables. 🙂 Let me know what you think or if you had a different problem.


SQL Error: Arithmetic overflow error converting nvarchar to data type numeric

You’ve probably  seen the following error:

There are a few causes for this error, but basically, this error is telling us that what we are trying to convert is bigger (overflows) than the data type that we defined for that  attribute.

Let’s ignore the requirements for now but for some reason we’re storing numbers in an nvarchar attribute.

Next, let’s explain the problem, we have a source column/attribute and a target column/attribute,

Source: it is an nvarchar(40)  that stores numbers with long decimals, i.e.

Target: it is an nvarchar(40) that stores numbers with only 4 decimals, i.e.  ‘699.9342’ or numeric (18,4)

So, in my mind, the solution looked like this:

1.-Convert the source value into numeric (18,4) as this was the requirement (we can store other data in this column, but if we store numbers, the client would like to format them as numeric (18,4)).

2.-Convert it back to nvarchar(40) already cleaned up, showing only 4 decimals.

Easy right? WRONG!

Now, let’s replicate the error,  the problem that I was facing was that my code looked like this.

 declare @source nvarchar(40) = '699.934239999999999999999999999999999999'
select @source
select CONVERT( nvarchar(40),CONVERT(numeric(22,4),@source))

Error comes up:

Basically SQL is trying to squeeze the 699.9342…… into the target data type, as we can see it is not automatically done, we need to workaround this and make it work.

Googling a little bit, I found the SQL Function STR that returns a float to a char by defining the total lenght of that char and the number of decimals that you’d like to show from that float.

Finally, I came up with the following:

 declare @source nvarchar(40) = '699.934239999999999999999999999999999999'
select @source
select ltrim(CONVERT( nvarchar(40),STR(cast (@source as float),22,4)))

It worked, however I had to add the LTRIM function to clean all the extra spaces to the left given that it is a CHAR result.

So, what do you think? Are there any other alternatives to this? Let me know your thoughts…



STR (Transact-SQL)