在巨大的文件中合并CSV行
我有一个像这样的CSV
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
尽pipe有50亿条logging。 如果您注意到第一列和第二列(当天)的一部分,则三个logging全部“分组”在一起,并且仅仅是当天的前30分钟的15分钟间隔。
我想要输出看起来像
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
其中重复行的前4列被省略,其余的列与第一个logging相结合。 基本上我每天换线的时间是15分钟,到每一天是1天。
由于我将处理50亿条logging,所以我认为最好的方法是使用正则expression式(和EmEditor)或一些为此(multithreading,优化)而devise的工具,而不是自定义的编程解决scheme。 尽pipe我对nodeJS或C#中的想法是比较简单和超快的。
如何才能做到这一点?
如果总是有一定数量的logginglogging,并且它们是有序的,那么一次只读几行并分析和输出它们是相当容易的。 试图做数十亿条logging的正则expression式将永远存在。 使用StreamReader
和StreamWriter
应该可以读取和写入这些大文件,因为它们一次读写一行。
using (StreamReader sr = new StreamReader("inputFile.txt")) using (StreamWriter sw = new StreamWriter("outputFile.txt")) { string line1; int counter = 0; var lineCountToGroup = 3; //change to 96 while ((line1 = sr.ReadLine()) != null) { var lines = new List<string>(); lines.Add(line1); for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1 lines.Add(sr.ReadLine()); var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is sw.WriteLine(groupedLine); } }
免责声明 – 没有error handling的未经testing的代码,并假设确实有正确的行数重复,等等。你显然需要做一些调整你的确切场景。
你可以做这样的事情(未经testing的代码没有任何error handling – 但应给你的一般要点):
using (var sin = new SteamReader("yourfile.csv") using (var sout = new SteamWriter("outfile.csv") { var line = sin.ReadLine(); // note: should add error handling for empty files var cells = line.Split(","); // note: you should probably check the length too! var key = cells[0]; // use this to match other rows StringBuilder output = new StringBuilder(line); // this is the output line we build while ((line = sin.ReadLine()) != null) // if we have more lines { cells = line.Split(","); // split so we can get the first column while(cells[0] == key) // if the first column matches the current key { output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line } // once the key changes sout.WriteLine(output.ToString()); // write out the line we've built up output.Clear(); output.Append(line); // update the new line to build key = cells[0]; // and update the key } // once all lines have been processed sout.WriteLine(output.ToString()); // We'll have just the last line to write out }
这个想法是循环遍历每一行,并跟踪第一列的当前值。 当这个值改变时,你写出你正在build立的output
行并更新key
。 这样你就不必担心你有多less比赛,或者你可能会错过几个点。
值得注意的是,如果要将96个行联合起来,那么将StringBuilder
用于output
而不是String
可能更有效。
定义ProcessOutputLine以存储合并的行。 在每个ReadLine之后和文件结尾调用ProcessLine。
string curKey ="" ; string keyLength = ... ; // set totalength of 4 first columns string outputLine = "" ; private void ProcessInputLine(string line) { string newKey=line.substring(0,keyLength) ; if (newKey==curKey) outputline+=line.substring(keyLength) ; else { if (outputline!="") ProcessOutPutLine(outputLine) curkey = newKey ; outputLine=Line ; }
编辑:这个解决scheme是非常相似的马特布兰德 ,唯一的差别是我不使用拆分function。