Hello everybody, welcome back to This Not That. This is a vlog where we talk about best practices within the data world and compare them to a common mistake that we often see in the industry.
Today we’re going to be talking about where it is [best] to do certain kinds of calculations within your data pipeline, and I’m going to walk you through what my advice is on that topic.
Welcome back everybody. So, one of the problems that I often run into or the conundrums if you will is where precisely to do some of the ETL (extract, transform and load) work that needs to be done for any given data project. There’s sort of a push and pull mechanic going on there. If you’re going to do it further upstream in your data pipeline, you get some benefits there. If you do it further downstream, usually that’s more accessible to your citizen analysts and civilians if you will, who are able to use those lower code environments.
My advice is when you can, do things upstream not downstream. Why is that?
Talking about data projects, and I work often in the reporting world, in your reports you want your report to do as little of the heavy lifting as possible because performance comes at a real premium at that point in time. There’s been some research done about the optimal time to wait between a click of the mouse or an interaction with something and the result popping up. In a perfect world you really want less than a second to wait, less than a second for your action to take effect and produce a result. Compare that if you will to what you expect in your back end, upstream processes. When you’ve written a SQL query of some kind, you don’t need a second response time, you can have a 10 second or 30 second or a 1 minute or much longer response time, it doesn’t matter. There’s a different expectation there for what it is you’re interacting with. Let me give you a real-world example.
One of the tools I commonly work in is called power bi. And in power bi you have the option to create a calculated column where it will go to your source table, or the imported copy that it has, it will go row by row and it will add a value to that. This works really well, it’s very simple to write and to execute. But the problem is that this occurs post compression. Earlier in the process when you’re working in power query, which is where you usually do your ETL steps, that step occurs. That query occurs and the result of that is then compressed and you’re adding an uncompressed value to the end of those results.
If you move it just that one step upstream, you’re going to save yourself a total file size. The size will get smaller AND your report will be faster as a result of that. You can move it up even further into SQL and then your power query will run faster.
There’s this chain reaction that occurs as you move things upstream.
Now as I said, anytime I give these guidelines there’s always an exception of some kind. So, there are places where the calculation is difficult to do upstream or you don’t have the admin privileges that you need or there’s a particular calculation that you want in a reporting tool that’s hard to replicate precisely in your upstream database. In those instances, sure, do it all the way downstream wherever you would like. But where you can, push those changes upstream and you will get positive results inside your reporting and its response time.
Thank you guys for watching today. I hope you enjoy these tips and tricks, things that I’ve found useful. If you liked this feel free to follow us or head over to the website or come to social media and leave me a comment about something you’d like me to talk about. Additionally, the company that I work for- Valorem, we’d love to partner with you. If you’ve got a [data] project and you want some advice or need some help- back end, front end, anywhere in between - we would love to come alongside and make you successful in the data work that you do.
Hope you all have a great week and I will see you next time