What I did to improve the data pipeline efficiency by 100x in MS SQL
I was going to build a training data pipeline fetching fresh labelled data from a MS SQL DB continuously. Soon after I noticed scalability would be a bottleneck here since this MS SQL DB is very OLTP focused and not indexed for massive analytical querying.
I’m going to share some specific science / engineering optimization strategies that I applied to overcome the scalability challenge and resulted in 100x efficiency boost.
Science
Move the under-sampling from the model pipeline to the data pipeline
The problem faced with is a severely imbalanced binary classification problem where minority class samples only takes up less than 10%. Therefore it’s not uncommon to apply under-sampling pr over-sampling when training the classifier. Particularly in our case, what’s got implemented in the model pipeline was under-sampling the majority class samples. The first trick that turned out to be very successful is to move the under-sampling from the model pipeline to the data pipeline. So it saved tons of time and resource as we don’t need to fetch majority class samples that would be thrown away.
Remove the time range condition as much as you can afford
Previously the data pipeline was designed to fetch daily records for both two classes. The advantage is to make parallel computing possible also make the data files more readable and manageable. However, adding time range condition such as BETWEEN
in the query was not efficient. I found removing it from the minority class data fetching query did speed it up a lot. The reason why I afforded to fetch ALL minority class without specifying the time range is because the count of minority class, unlike the majority class, is relatively small.
Engineering
ReadOnly
Make sure you add the readonly
option in the connection string so that we have less waits as we are contending with any inserts in the read only node.
Control how many cores at use
Set rate limit with maxdop=1
which will slow it down a bit as it will use only one core at a time. But having the setting in-place will help the server become more stable.