We have seen in our PowerBI Best Practice Article that it is best to work with data at maximum level of detail. Well the problem is that sometimes it can be a lot of data… like really a lot…
One day I wanted to extract a lot of information from an ERP system. Not only the data was way above 1m rows Excel capacity, but I was also limited by how much information I could extract from this ERP at at time.
Crazy isn’t it? You know the information you need is out there but it’s LOCKED… What is the point of owning a Ferrari if you’re only going to drive it in first gear?
So what can you do? If you have read this article, you know that PowerBI is the right tool for you. We will show you how to use it depending on how much resources you may have or support from our IT department.
Regarding your data, your have essentially 3 options:
- Use PowerBI to connect directly to the database
- Make it fit
- Work around it
Use PowerBI to connect directly to the database
Good news #1: we know a very cool tool: PowerBI who is capable of handling large amount of data, this is the way to go!
Good news #2: it is wired with tons of connectors to big databases (SAP, BW, Hadoop…)
If you get sufficient support from your IT department you should be able to activate these connectors and work without any issue with your large dataset
However if you are reading this blog, we do not expect you to get this kind of resources or too much support from your IT department. Your are left with the last 2 options:
Make it fit
At the other end of the spectrum, you can obviously try to make your data fit by reducing the amount of information you bring into PowerBI. This can essentially be done by
- filtering: by filtering out some of your data, based on some attributes or combinations of attributes that you are less likely to need later you can make your dataset smaller and eventually make it fit within your capacity requirements. You can also limit your time horizon
- aggregating: by giving up strategically on some levels of details that will be less likely to be useful in your analysis later on. For example instead of brining data at a daily or weekly level of detail you might consider monthly.
In our opinion, the Make It Fit option should be activated only in last resort. If you go for it, since you will obviously lose some detail in your information, be strategic about it. Think wisely about which levels of details are less likely to add value to your analysis and take decision on which data to leave out accordingly
Thankfully even with limited IT resources, we do not take NO for an answer and there is a last option:
Work around it
This sounds absurd: we don’t have the tools to manage (extract or store) our data with maximum level of detail, but we don’t give hope that we will import it into PowerBI anyway. How? by slicing it and re-aggregating it later!
- Slicing: First you need to slice your data into slices that fit within your extraction capacities. For this you can refer to the filtering strategy of the “Make it Fit” section.
Define some filters so that your extraction is manageable and run them one after the other so that you get the full data set.
Obviously the less slices the better, so don’t necessarily limit yourself with the storage capacities of Excel or Access. Just focus on your extractions capacities and Go with Text (.TXT) files for Storage if necessary
- Re-aggregating: this can be managed with PowerBI using the Append Query function in the Query Editor
This works seamlessly if you have a few slices to re-aggregate however if you have many you might want to run a batch to manage this aggregation outside PowerBI
Back to my ERP extraction problem, unfortunately there was no direct connector for PowerBI that I could use so I got the IT department to extract the past information for me into a giant text file, and to run a script every day to extract the new daily data and drop it into another TXT file. All TXT files were in a folder that I could access.
I was left with a bunch of TXT files with the same format, the only thing I had to do is aggregate them together. A quick Python script later (that I’m giving to you below) and I was done. I managed to get all my data with maximum detail out of this ERP and play with it in PowerBI!
Never give up!
Appendix: how to aggregate text files using Python
Text files are not limited in row capacity and could be a efficient low cost way to store large datasets.
One of the best tool to work with text files is Python.
If you do not have Python you can download it for free from the Official Python website
Here we suppose that:
- you have a file called Startfile.txt at the path Path/Startfile.txt which contains the headers (and only the headers) of all your text files (which are all the same and in the same order)
- that all the text files that you want to append together are in a folder called Path/Files ; and that they are the only text files in this folder
- you have a file called Finalfile.txt that will contain the results of your appending of all the text files containing your data. This file is at the path Path/Finalfile.txt
What you need to do:
- open your notepad and copy paste the script below
- replace Path by the actual path where your files will be stored (and only these files), create the Startfile and drop your other text files in the Path/Files folder
- save this script as a .py extension.
- Run it, that’s it! You can now open your Finalfile.txt or connect it to PowerBI to read your data
g = open(‘Path/Finalfile.txt’,’w’)
s = open(‘Path/Startfile.txt’,’r’)
for fle in file:
with open(fle) as f:
text = f.read().splitlines(True)