在巨大的文件中合并CSV行

我有一个像这样的CSV

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 

尽pipe有50亿条logging。 如果您注意到第一列和第二列(当天)的一部分,则三个logging全部“分组”在一起,并且仅仅是当天的前30分钟的15分钟间隔。

我想要输出看起来像

 783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 

其中重复行的前4列被省略,其余的列与第一个logging相结合。 基本上我每天换线的时间是15分钟,到每一天是1天。

由于我将处理50亿条logging,所以我认为最好的方法是使用正则expression式(和EmEditor)或一些为此(multithreading,优化)而devise的工具,而不是自定义的编程解决scheme。 尽pipe我对nodeJS或C#中的想法是比较简单和超快的。

如何才能做到这一点?

如果总是有一定数量的logginglogging,并且它们是有序的,那么一次只读几行并分析和输出它们是相当容易的。 试图做数十亿条logging的正则expression式将永远存在。 使用StreamReaderStreamWriter应该可以读取和写入这些大文件,因为它们一次读写一行。

 using (StreamReader sr = new StreamReader("inputFile.txt")) using (StreamWriter sw = new StreamWriter("outputFile.txt")) { string line1; int counter = 0; var lineCountToGroup = 3; //change to 96 while ((line1 = sr.ReadLine()) != null) { var lines = new List<string>(); lines.Add(line1); for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1 lines.Add(sr.ReadLine()); var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is sw.WriteLine(groupedLine); } } 

免责声明 – 没有error handling的未经testing的代码,并假设确实有正确的行数重复,等等。你显然需要做一些调整你的确切场景。

你可以做这样的事情(未经testing的代码没有任何error handling – 但应给你的一般要点):

 using (var sin = new SteamReader("yourfile.csv") using (var sout = new SteamWriter("outfile.csv") { var line = sin.ReadLine(); // note: should add error handling for empty files var cells = line.Split(","); // note: you should probably check the length too! var key = cells[0]; // use this to match other rows StringBuilder output = new StringBuilder(line); // this is the output line we build while ((line = sin.ReadLine()) != null) // if we have more lines { cells = line.Split(","); // split so we can get the first column while(cells[0] == key) // if the first column matches the current key { output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line } // once the key changes sout.WriteLine(output.ToString()); // write out the line we've built up output.Clear(); output.Append(line); // update the new line to build key = cells[0]; // and update the key } // once all lines have been processed sout.WriteLine(output.ToString()); // We'll have just the last line to write out } 

这个想法是循环遍历每一行,并跟踪第一列的当前值。 当这个值改变时,你写出你正在build立的output行并更新key 。 这样你就不必担心你有多less比赛,或者你可能会错过几个点。

值得注意的是,如果要将96个行联合起来,那么将StringBuilder用于output而不是String可能更有效。

定义ProcessOutputLine以存储合并的行。 在每个ReadLine之后和文件结尾调用ProcessLine。

 string curKey ="" ; string keyLength = ... ; // set totalength of 4 first columns string outputLine = "" ; private void ProcessInputLine(string line) { string newKey=line.substring(0,keyLength) ; if (newKey==curKey) outputline+=line.substring(keyLength) ; else { if (outputline!="") ProcessOutPutLine(outputLine) curkey = newKey ; outputLine=Line ; } 

编辑:这个解决scheme是非常相似的马特布兰德 ,唯一的差别是我不使用拆分function。