解码Windows-1252和引用可打印的HTML的组合

我得到了一段代表HTML的文本，例如：

<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n

从HTML <meta>标签，我可以看到，这段HTML应该被编码为Windows-1252。

我正在使用node.jsparsing与cheerio这段文字。但解码与https://github.com/mathiasbynens/windows-1252不帮助： windows1252.decode(myString); 正在返回相同的inputstring。

我认为的原因是因为inputstring已经在标准的node.js charset中编码了，但是它实际上代表了一个windows-1252编码的HTML片段（如果这是有道理的话）。

检查那些奇怪的hex数前面加上=我可以看到有效的windows-1252代码，例如：

这个=\r\n和这个\r\n应该以某种方式代表在Windows世界的回车，
=3D ：HEX 3D是DEC 61 ，它是等号： = ，
=96 ：HEX 96是DEC 150 ，它是一个'破折号'： – （某种“长减号”），
=A3 ：HEX A3是DEC 163这是一个英镑符号： £

我没有控制这段HTML的代码，但我应该parsing它，并清理它给予£ （而不是=A3 ）等。

现在，我知道我可以保存与转换的内存映射，但我想知道是否已经有一个覆盖整个windows-1252字符集的编程解决scheme？

参看这整个转换表： https ： //www.w3schools.com/charsets/ref_html_ansi.asp

编辑：

input的HTML来自一个IMAP会话，所以似乎有一个7位/ 8位“引用的可打印的编码”在上游，我无法控制（参见https://en.wikipedia.org/wiki/Quoted-printable ）。

与此同时，我开始意识到这种额外的编码，我试过这个quoted-printable （参见https://github.com/mathiasbynens/quoted-printable ）库，但没有运气。

以下是一个MCV（根据请求）：

 var cheerio = require('cheerio'); var windows1252 = require('windows-1252'); var quotedPrintable = require('quoted-printable'); const inputString = '<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n' const $ = cheerio.load(inputString, {decodeEntities: true}); const bodyContent = $('html body').text().trim(); const decodedBodyContent = windows1252.decode(bodyContent); console.log(`The input string: "${bodyContent}"`); console.log(`The output string: "${decodedBodyContent}"`); if (bodyContent === decodedBodyContent) { console.log('The windows1252 output seems the same of as the input'); } const decodedQp = quotedPrintable.decode(bodyContent) console.log(`The decoded QP string: "${decodedQp}"`);

以前的脚本正在生成以下输出：

 The input string: "This should be a pound sign: =A3 and this should be a long dash: =96" The output string: "This should be a pound sign: =A3 and this should be a long dash: =96" The windows1252 output seems the same of as the input The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "

在我的命令行中，我看不到长长的短划线，我不知道如何正确解码所有这些=<something>编码的字符？

看起来，通过IMAP收到的消息提供了两种不同编码的组合：

实际的string是根据“引用可打印”编码（ https://en.wikipedia.org/wiki/Quoted-printable ）进行编码，因为我认为在通过IMAP通道传输信息时存在7bit / 8bit映射问题（一个TCP套接字连接）
内容的逻辑表示（电子邮件正文），它是带有带有Windows-1252字符集的<meta>标记的HTML

这些HTML块在Windows风格（ \r\n ）中包含大量回车符也有一个“问题”。我不得不预先处理string来处理，就我而言：删除那些回车。

以下MCV示例应该显示清理和validation表示电子邮件正文的string内容的过程：

 var quotedPrintable = require('quoted-printable'); var windows1252 = require('windows-1252'); const inputStr = 'This should be a pound sign: =A3 \r\nand this should be a long dash: =96\r\n'; console.log(`The original string: "${inputStr}"`); // 1. clean the "Windows carriage returns" (\r\n) const cleandStr = inputStr.replace(/\r\n/g, ''); console.log(`The string without carriage returns: "${cleandStr}"`); // 2. decode using the "quoted printable protocol" const decodedQp = quotedPrintable.decode(cleandStr) console.log(`The decoded QP string: "${decodedQp}"`); // 3. decode using the "windows-1252" const windows1252DecodedQp = windows1252.decode(decodedQp); console.log(`The windows1252 decoded QP string: "${windows1252DecodedQp}"`);

这给出了这个输出：

 The original string: "This should be a pound sign: =A3 and this should be a long dash: =96 " The string without carriage returns: "This should be a pound sign: =A3 and this should be a long dash: =96" The decoded QP string: "This should be a pound sign: £ and this should be a long dash: " The windows1252 decoded QP string: "This should be a pound sign: £ and this should be a long dash: –"

注意在Windows-1252解码阶段之前/之后呈现的“长破折号字符”。

Afaik，这与UTF-8编码/解码无关。我能够从中找出该程序的“解码顺序”： https ： //github.com/mathiasbynens/quoted-printable/issues/5

我不确定的一件事是，如果我正在运行这段代码的操作系统对文件或stringstream的字符集/编码有某种影响。

我使用的npm包是：

解码Windows-1252和引用可打印的HTML的组合

node.js从utf8文件创buildhex缓冲区

节点js处理西里尔文编码有什么问题

为什么Node.js将BOM字符转换为0xFE 0xFF？

在nodejs的post请求中设置charset

响应编码与node.js“请求”模块

如何在Node.js中stream式处理文件并将编码保持为ansi（windows-1252）

自动检测Node.js中的字符编码

如何在NodeJS中编码/解码字符集编码？

使用node.js从windows-1250编码的网页获取正确的string

节点JS POST多部分/表单数据请求