在巨大的文件中合并CSV行

我有一个像这样的CSV

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

尽pipe有50亿条logging。如果您注意到第一列和第二列（当天）的一部分，则三个logging全部“分组”在一起，并且仅仅是当天的前30分钟的15分钟间隔。

我想要输出看起来像

 783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y 783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y ... 783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

其中重复行的前4列被省略，其余的列与第一个logging相结合。基本上我每天换线的时间是15分钟，到每一天是1天。

由于我将处理50亿条logging，所以我认为最好的方法是使用正则expression式（和EmEditor）或一些为此（multithreading，优化）而devise的工具，而不是自定义的编程解决scheme。尽pipe我对nodeJS或C＃中的想法是比较简单和超快的。

如何才能做到这一点？

如果总是有一定数量的logginglogging，并且它们是有序的，那么一次只读几行并分析和输出它们是相当容易的。试图做数十亿条logging的正则expression式将永远存在。使用StreamReader和StreamWriter应该可以读取和写入这些大文件，因为它们一次读写一行。

 using (StreamReader sr = new StreamReader("inputFile.txt")) using (StreamWriter sw = new StreamWriter("outputFile.txt")) { string line1; int counter = 0; var lineCountToGroup = 3; //change to 96 while ((line1 = sr.ReadLine()) != null) { var lines = new List<string>(); lines.Add(line1); for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1 lines.Add(sr.ReadLine()); var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is sw.WriteLine(groupedLine); } }

免责声明 – 没有error handling的未经testing的代码，并假设确实有正确的行数重复，等等。你显然需要做一些调整你的确切场景。

你可以做这样的事情（未经testing的代码没有任何error handling – 但应给你的一般要点）：

 using (var sin = new SteamReader("yourfile.csv") using (var sout = new SteamWriter("outfile.csv") { var line = sin.ReadLine(); // note: should add error handling for empty files var cells = line.Split(","); // note: you should probably check the length too! var key = cells[0]; // use this to match other rows StringBuilder output = new StringBuilder(line); // this is the output line we build while ((line = sin.ReadLine()) != null) // if we have more lines { cells = line.Split(","); // split so we can get the first column while(cells[0] == key) // if the first column matches the current key { output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line } // once the key changes sout.WriteLine(output.ToString()); // write out the line we've built up output.Clear(); output.Append(line); // update the new line to build key = cells[0]; // and update the key } // once all lines have been processed sout.WriteLine(output.ToString()); // We'll have just the last line to write out }

这个想法是循环遍历每一行，并跟踪第一列的当前值。当这个值改变时，你写出你正在build立的output行并更新key 。这样你就不必担心你有多less比赛，或者你可能会错过几个点。

值得注意的是，如果要将96个行联合起来，那么将StringBuilder用于output而不是String可能更有效。

定义ProcessOutputLine以存储合并的行。在每个ReadLine之后和文件结尾调用ProcessLine。

 string curKey ="" ; string keyLength = ... ; // set totalength of 4 first columns string outputLine = "" ; private void ProcessInputLine(string line) { string newKey=line.substring(0,keyLength) ; if (newKey==curKey) outputline+=line.substring(keyLength) ; else { if (outputline!="") ProcessOutPutLine(outputLine) curkey = newKey ; outputLine=Line ; }

编辑：这个解决scheme是非常相似的马特布兰德 ，唯一的差别是我不使用拆分function。

在巨大的文件中合并CSV行

无法从subprocess的标准输出中获取任何内容

与node.exe运行JavaScript文件的问题

使c ++插件asynchronous

传递SQL Server连接（节点到C＃）

在缓冲区的JavaScript代码节点js segfault

v8从C ++中的nodejs中提取全局对象

编译内置c ++ 17的本地节点插件在npm安装期间失败

在C＃中将对象转换为Json并通过POST发送它会导致一个损坏的对象？

将一个x86 C dll符号包装到nodejs javascript中

适用于NodeJs和mbedtls的encryption程序