Brief Backgound
When we are working on ETL design, I believe it is quite common that data is coming from various sources, such as SFTP, email attachment, online web portal, etc. Usually in this case I will design one package per data source, and use a parent package to handle the overall project.However there might be cases client will do this task, I am saying, client extracts data from various sources and combine all into one single file. It sounds good, but sometimes it leads to trouble, for example, there probably will be no insurance about the data quality.
In one of my projects, I were asked to create a customer pool for marketing campaign purpose. Client extracted data from various sources, billing system, sales & marketing department, 3rd party data broker, and then generated one big flat file on a weekly basis.
The nightmare is data quality, some rows could have 4 columns, some rows could have 6 columns... So I wrote below package to solve this problem
Actual Design
Firstly lets look at what is the problem.I have a SSIS package opened, and within the package I have a Flat File Connection Manager, points to a sample file.
Now I set columns as per normal.
And now I add a Data Flow Task, and within the Data Flow Task, I add some components as shown below
Quite straightforward, isn't it? Now I run it, and found an error:
Open the sample file within Excel, we can see number of columns is the actual reason for the error
While, there are plenty of ways to handle this type of issue:
- Ask client to fix the data problem, control the data quality, as everyone knows the best position to fix data quality is from data source, unfortunately, in my case, it is too perfect to be true.
- Use some .Net code in script task to handle it. It works and should not be a problem.
Firstly, Lets change the setting in the Flat File Connection Manager. Because we want the file to be treated as one single column file, I select "$$$" as the column delimiter.
Now to avoid truncate error, I change the column length to 8000.
Then the Data Flow Task layout looks like below. The Derived Column Component will read number of delimiters from the file, and Conditional Split Component will direct the stream by this newly created derived column.
In the Derived Columns Component, I write two expressions. The first one is to read number of tokens by using new function TOKENCOUNT from SSIS2012. If you are using SSIS2008, the 2nd expression will work for you. It reads number of delimiters.
SSIS 2012: TOKENCOUNT([Name|LogDate|Phone|OptIn],"|")
SSIS 2008: LEN([Name|LogDate|Phone|OptIn]) - LEN(REPLACE([Name|LogDate|Phone|OptIn],"|",""))
In the Conditional Split Component, I define outputs for correct output which can be loaded directly, dynamic output which can be loaded by a dynamic reader, and output needs to be reviewed manually
Now run the package again, it is what I need :D
Is there a package that I can look at with the above code?
ReplyDelete