2016-04-19

SSIS Case Study: Optimise the Data Process

When design a data procedure to deal with large amount of data, the white paper “We Loaded 1TB in 30 Minutes with SSIS, and So Can You” is always a good starting point to begin our design task. However, I am afraid in a real word scenario, it might be a bit difficult to get an environment set up like that: gigabytes fibre channel, independent physical servers, flat file data source without data quality issue, stable control flow/data flow without checkpoint and error handling. Well, it is simply too good to be true.

Recently we migrated all our client activities to a new dialler platform, after that I received the request to achieve/back up all business critical data from the old dialler platform to our local database server. Because of some business decisions, we could not ask for back up files directly, so we decided to use SSIS to pull all data we need from the remote site.

The old dialler system is sitting on MySQL database hosted in a remote data centre. The connection between the remote data centre and our site is a 30MB fibre connection. The data needs to be downloaded into an archive database hosted on a virtual server. We need to avoid some certain time points, as the full back up tasks could freeze IO. And additionally we need to download recording files as well.

The packages are simple, here I just want to share some lessons we learnt from the design:


Allow Errors & Propagate = False


Like common ETL framework, let the job skips the error, logs the error, and handles the error at a later stage. In our case we created tables like below: DownloadHelper to store parameters for each small packages, DownloadLog to store pre-execution information and post-execution information, and ErrorLog to store actual exceptions. So when an error occurs, we can adopt different approaches to handle the error based on its severity level.






To allow the package bypass the error, firstly bring up Variable Grid Options window, and check the option “Show system variables”.




Then go to Event Handlers tab, add an OnError event handler for selected task. In addition to log the error on the screen, the most important setting is to change variable “Propagate” from True to False. After this change, the job will continue to next step even though the task issues an error.




Simplify Data Flow


To improve the data flow process, we removed nearly all transformation components within the data flow, especially the data conversion component.

However there are differences in data types between MySQL and SQL Server. So to handle the differences, we built destination table schema by using openquery targeting remote MySQL database:

 select * into [destination] from openquery([LinkServerName], 'select * from [MySQL table] limit 1')  

Eventually most of our data flow tasks look like below screenshot:





Optimize Data Flow


Loading the data into partitions/using balanced loading approach could be a great help in many cases, but after evaluate our situation, we did not adopt this approach as we found the bandwidth is the major component to affect our data flow. And obviously, we won’t increase the bandwidth just because of archives/backup job.

So within our environment we just did below configurations:

1. For the data flow task, increase the size of DefaultBufferMaxRows and DefaultBufferSize;



2. For the data destination component, select Table or View – Fast Load, and uncheck Keep Identity, Keep Nulls, and Check Constraints, and check Table Lock option. Leave the SSIS service to determine rows per batch and maximum commit size.




After did all these settings, below are the comparison result. As you can see, the total time spent dropped a lot. Well, this result cannot compare to the case as mentioned at the beginning of this post, but within our environment it is a very acceptable result.