Combine data files with Power Query
Posted by:Catalin Bombea | March 24, 2022
Data variety is a challenge when you need to combine it into a single table. If data is consistent over time, the usual tools within power Query will not fail, but this scenario rarely happens in the real life.
Most common issues when combining data:
- there is a change in headers, refreshing the query fails to identify new headers;
- data comes in a different format than expected: xlsx, csv.
Download the demo files to try the solution:
Make sure you unzip the files on a location from your computer, then update the path from Settings sheet (in the file named GetCSV and XLSX from folder.xlsx) to point to your Data folder new location.
Why Power Query fails to refresh when there is a change in source data?
When you use the default tool “Combine Files” when you Get Data from Folder with Power Query, the application will analyze the first data file from that folder, or you can select the file that will serve as reference.
Here is how an automatic set of queries looks like:
Power Query will automatically build a custom function that reads the reference file and applies it to each file from that folder. This custom function created based on the reference file will fail if:
- the current file type is different than the reference file type, Power Query will fail to apply the same steps applied to the reference file.
- if the file does not contain the exact headers found in the reference file,
- the file selected as a reference is no longer in the source folder.
Because of these frequent errors, a more flexible approach is needed.
How to create a flexible solution to combine data from a folder:
If there are different file types, you have to process each file type separately.
Processing csv Files:
Here is a very simple query that will process only csv files from a folder, filtering out any other file type:
let Path=Excel.CurrentWorkbook(){[Name="PathTable"]}[Content]{0}[Source Data Folder Path], Source = Folder.Files(Path), #"Filtered Rows" = Table.SelectRows(Source, each [Extension]=".csv"), #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each Table.PromoteHeaders(Csv.Document([Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]))), #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}), #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", Table.ColumnNames(Table.Combine(#"Removed Other Columns"[Custom])),Table.ColumnNames(Table.Combine(#"Removed Other Columns"[Custom]))) in #"Expanded Custom"
Processing xlsx Files:
In the same manner, you can process only xlsx file with the following query:
let Path=Excel.CurrentWorkbook(){[Name="PathTable"]}[Content]{0}[Source Data Folder Path], Source = Folder.Files(Path), #"Filtered Rows" = Table.SelectRows(Source, each Text.Contains(Text.Lower([Extension]),".xl")), #"Filtered Rows3" = Table.SelectRows(#"Filtered Rows", each not Text.StartsWith([Name], "~")), #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows3",{"Content"}), #"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each Excel.Workbook([Content])), #"Removed Other Columns1" = Table.SelectColumns(#"Added Custom",{"Custom"}), #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns1", "Custom", {"Name", "Data", "Item", "Kind", "Hidden"}, {"Name", "Data", "Item", "Kind", "Hidden"}), #"Filtered Rows1" = Table.SelectRows(#"Expanded Custom", each ([Kind] = "Sheet")), #"Removed Other Columns2" = Table.SelectColumns(#"Filtered Rows1",{"Data"}), Result = Table.Combine(Table.TransformColumns(#"Removed Other Columns2",{"Data",each Table.PromoteHeaders(_)})[Data]), Tbl = if Table.IsEmpty(Result) then #table({},{}) else Result in Tbl
Both csv and xlsx queries will always combine all headers found in the source files, therefore any new headers added in the source files will be identified and combined in the output table.
The main query that combines the specific queries is very simple:
let Source = Table.Combine({CSVData,XLSXData}) in Source
Using this flexible structure, you’ll be able to easily combine any csv and xlsx files with any headers, the refresh will never fail if headers change.
Of course, you can add different file types, if csv and xlsx are not enough, simply create a similar query that should process only text files. However, queries from text files may vary a lot, so a generic universal query from .txt files cannot be easily created.
Have fun,
Catalin