从Cheerio.js内容中删除unicode字符

我正在使用cheeriojs从网页中删除内容，并使用以下HTML。

<p> Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday. <br> <br> “The outcome will be made public in due course,” John said in an SMS yesterday. <br> <br> </p>

我可以通过class和id标签来获得感兴趣的内容，如下所示：

 $('.top-stories .line.more').each(function(i, el){ //Do something… let content = $(this).next().html(); }

一旦我捕获了感兴趣的内容，我使用正则expression式来“清理”它，如下所示：

 let cleanedContent = content.split(/<br>/).join(' \n ');

插入一个空行标签(<br>)匹配的换行符。到目前为止，一切都很好，直到我看清楚下面的内容：

 Although the PM&apos;s office could neither confirm nor deny this, the spokesperson, Saima Shaanika said the meeting took place on Friday. “The outcome will be made public in due course,”

看起来，标点符号，也许还有其他一些字符，根据它们的Unicode码存储。我可能在这方面是错误的，并欢迎对这一思路进行一些修正。

假设他们被存储为unicode代码，是否有一个模块，我可以通过“cleanContent”variables，通过将unicodes转换为人类可读的标点符号/字符？

如果这不可能，是否有更好的实施cheeriojs会避免这种情况？我对我没有正确使用cherriojs的观点完全开放，并且会喜欢一些方向，而我可以尝试新方法。

我能想到的一种方法是编写一个包含多个unicode及其相应unicode的模块，然后查找匹配项，并用相应的人类可读字符replace匹配的代码。我有一些直观的感觉，有人已经做了这个或类似的东西。我宁愿不尝试重新发明轮子。

提前致谢。

Cheerio在内部使用htmlparser2。

因此，您可以在加载HTMLstring时使用htmlparser2的decodeEntities选项 ，从而允许您configurationHTML实体的处理方式。

例：

 $ = cheerio.load('<ul id="fruits">...</ul>', { decodeEntities: false });

相关文件：

Cheerio
htmlparser2

从Cheerio.js内容中删除unicode字符

jQuery获取除了子元素X之外的子元素的HTML

如何使用cheerio获取脚本内容

如何在node.js服务器上加载图像？

如何在使用CasperJS的文件中使用NPM模块？

使用tinyreq / cheerio有没有办法绕过网站上的证书链？

jQuery访问站点中的DOM

我怎样才能用Node.js刮页面

使用Node.js编辑HTML文件

刮cheerio.js，得到：错误：只能在暂停时执行操作

阻止使用迭代的JavaScript嵌套callback