在node.js中使用XPath

我正在node.js中构build一个小文档parsing器。为了testing，我有一个原始的HTML文件，通常在应用程序执行时从真实网站下载。

我想从Console.WriteLine的每个部分提取符合我的约束的第一个代码示例 – 它必须用C＃编写。要做到这一点，我有这样的XPath：

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

如果我在线testingXPath ，我会得到预期的结果，这是在这个Gist 。

在我的node.js应用程序中，我使用xmldom和xpath来尝试parsing完全相同的信息：

 var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`; var doc = new dom().parseFromString(rawHtmlString, 'text/html'); var sampleNodes = xpath.select(exampleLookup,doc);

但是，这不会返回任何内容。

这里可能会发生什么？

这很可能是由HTML（XHTML）中的默认命名空间（ xmlns="http://www.w3.org/1999/xhtml" ）引起的。

看看xpath文档，你应该能够使用useNamespaces绑定名称空间到前缀，并在xpath中使用前缀（未经testing）…

 var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`; var doc = new dom().parseFromString(rawHtmlString, 'text/html'); var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"}); var sampleNodes = xpath.select(exampleLookup,doc);

你可以在你的XPath中使用local-name()来代替将名字空间绑定到前缀，但是我不推荐它。这也包括在文档中。

例…

 //*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]

在node.js中使用XPath

将脚本元素添加到DOM而不执行它

比较两个相等节点的childNodes返回false，比较它们的innerHTML返回true，为什么？

响应string中的未知字符

使用partialreplace存储为string的html元素

使用PhantomJs，Nodejs和MySQL

从js中用node.js和horseman刮取html

技术为服务器端DOM操作（Java与节点）

EventListener存储在哪里？

如何删除htmlparse的一部分？

即使在重新加载页面期间如何使DOM元素可见？