The simple script below shows how to combine csv files without duplicating headings using C#. This technique assumes that all of the files have the same structure, and that all of them contain a header row.
The timings in this post came from combining 8 csv files with 13 columns and a combined total of 9.2 million rows.
I first tried combining the files with the PowerShell technique described here. It was painfully slow and took an hour and a half! This is likely because it is deserializing and then serializing every bit of data in the files, which adds a lot of unnecessary overhead.
Next I tried the C# script below using LINQPad. When reading from and writing to a network share, it took 3 minutes and 56 seconds. Much better! Next I tried it on a local SSD drive and it took just 44 seconds. To recap:
- PowerShell with deserialization/serialization: 90 minutes
- C# with source and destination on network drive: 3 minutes and 56 seconds
- C# with source and destination on a local SSD: 44 seconds
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
/* C# script combines multiple csv files without duplicating headers. Combining 8 files with a combined total of about 9.2 million rows took about 3.5 minutes on a network share and 44 seconds on an SSD. */ string sourceFolder = @"C:\CSV_Files"; string destinationFile = @"C:\CSV_Files\CSV_Files_Combined.csv"; // Specify wildcard search to match CSV files that will be combined string[] filePaths = Directory.GetFiles(sourceFolder, "CSV_File_Number?.csv"); StreamWriter fileDest = new StreamWriter(destinationFile, true); int i; for (i = 0; i < filePaths.Length; i++) { string file = filePaths[i]; string[] lines = File.ReadAllLines(file); if (i > 0) { lines = lines.Skip(1).ToArray(); // Skip header row for all but first file } foreach (string line in lines) { fileDest.WriteLine(line); } } fileDest.Close(); |
Let me know what you think in the comments. And if you know of a faster way to accomplish this, I’d love to hear it!
Not a faster way but if the columns of the different csv files are partially mismatched then the below will solve the problem
Also included are the helper classes containing the methods to move the data from column to row-based format then to csv format.
On the line that reads, “File.ReadLines(filePaths[i]).Skip(1).Select(line => line.Split(splitter)).ForEach(spline =>” I’m getting the error below. Does anyone know what is wrong? Unfortunately my Linq is rusty.
Error CS1061 ‘IEnumerable<string[]>’ does not contain a definition for ‘ForEach’ and no accessible extension method ‘ForEach’ accepting a first argument of type ‘IEnumerable<string[]>’ could be found (are you missing a using directive or an assembly reference?)
Thanks in advance
That sounds very useful, Declan. Thanks for sharing the code!
very cool, i was just writing this exact thing myself, and your idea of merging mismatched header is genius
UwU i wonder right now, did we get other way to do this? or have library for it? But in anway this post have helped me much :3 Thanks guys