parsingHTML并保留原始内容

我有很多的HTML文件。我想replace一些元素，保持所有其他内容不变。例如，我想执行这个jQueryexpression式（或者它的一些等价物）：

$('.header .title').text('my new content')

在以下HTML文件上：

 <div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table>

并有以下结果：

 <div class=header><span class=title>my new content</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table>

问题是，我尝试过的所有parsing器（ Nokogiri ， BeautifulSoup ， html5lib ）将其序列化为如下所示：

 <html> <head></head> <body> <div class=header><span class=title>my new content</span></div> <p>1</p><p>2</p> <table><tbody><tr><td>1</td></tr></tbody></table> </body> </html>

例如他们增加：

HTML，头部和身体的元素
closuresp标签
TBODY

有一个parsing器可以满足我的需求吗？它应该在Node.js，Ruby或Python中工作。

我强烈推荐Python的pyquery包。它是一个类似于jQuery的接口，它被分层在非常可靠的lxml包（一个绑定到libxml2的python）上。

我相信这确实是你想要的，有一个非常熟悉的界面。

 from pyquery import PyQuery as pq html = ''' <div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table> ''' doc = pq(html) doc('.header .title').text('my new content') print doc

输出：

 <div><div class="header"><span class="title">my new content</span></div> <p>1</p><p>2 </p><table><tr><td>1</td></tr></table></div>

closuresp标签不能被帮助。 lxml只保留原始文档的值，而不是原文的变幻莫测。段落可以有两种方式，在序列化时select较为标准的方式。我不相信你会find一个更好的（无错误）parsing器。

注意：我在Python 3上。

这将只处理CSSselect器的一个子集，但它可能已经足够您的目的。

 from html.parser import HTMLParser class AttrQuery(): def __init__(self): self.repl_text = "" self.selectors = [] def add_css_sel(self, seltext): sels = seltext.split(" ") for selector in sels: if selector[:1] == "#": self.add_selector({"id": selector[1:]}) elif selector[:1] == ".": self.add_selector({"class": selector[1:]}) elif "." in selector: html_tag, html_class = selector.split(".") self.add_selector({"html_tag": html_tag, "class": html_class}) else: self.add_selector({"html_tag": selector}) def add_selector(self, selector_dict): self.selectors.append(selector_dict) def match_test(self, tagwithattrs_list): for selector in self.selectors: for condition in selector: condition_value = selector[condition] if not self._condition_test(tagwithattrs_list, condition, condition_value): return False return True def _condition_test(self, tagwithattrs_list, condition, condition_value): for tagwithattrs in tagwithattrs_list: try: if condition_value == tagwithattrs[condition]: return True except KeyError: pass return False class HTMLAttrParser(HTMLParser): def __init__(self, html, **kwargs): super().__init__(self, **kwargs) self.tagwithattrs_list = [] self.queries = [] self.matchrepl_list = [] self.html = html def handle_starttag(self, tag, attrs): tagwithattrs = dict(attrs) tagwithattrs["html_tag"] = tag self.tagwithattrs_list.append(tagwithattrs) if debug: print("push\t", end="") for attrname in tagwithattrs: print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="") print("") def handle_endtag(self, tag): try: while True: tagwithattrs = self.tagwithattrs_list.pop() if debug: print("pop \t", end="") for attrname in tagwithattrs: print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="") print("") if tag == tagwithattrs["html_tag"]: break except IndexError: raise IndexError("Found a close-tag for a non-existent element.") def handle_data(self, data): if self.tagwithattrs_list: for query in self.queries: if query.match_test(self.tagwithattrs_list): line, position = self.getpos() length = len(data) match_replace = (line-1, position, length, query.repl_text) self.matchrepl_list.append(match_replace) def addquery(self, query): self.queries.append(query) def transform(self): split_html = self.html.split("\n") self.matchrepl_list.reverse() if debug: print ("\nreversed list of matches (line, position, len, repl_text):\n{}\n".format(self.matchrepl_list)) for line, position, length, repl_text in self.matchrepl_list: oldline = split_html[line] newline = oldline[:position] + repl_text + oldline[position+length:] split_html = split_html[:line] + [newline] + split_html[line+1:] return "\n".join(split_html)

请参阅下面的示例用法。

 html_test = """<div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td class=hi><div id=there>1</div></td></tr></table>""" debug = False parser = HTMLAttrParser(html_test) query = AttrQuery() query.repl_text = "Bar" query.add_selector({"html_tag": "div", "class": "header"}) query.add_selector({"class": "title"}) parser.addquery(query) query = AttrQuery() query.repl_text = "InTable" query.add_css_sel("table tr td.hi #there") parser.addquery(query) parser.feed(html_test) transformed_html = parser.transform() print("transformed html:\n{}".format(transformed_html))

输出：

 transformed html: <div class=header><span class=title>Bar</span></div> <p>1<p>2 <table><tr><td class=hi><div id=there>InTable</div></td></tr></table>

好吧，我已经用几种语言完成了这个工作，我必须说我看到的最好的parsing器保留了空格，甚至HTML注释是：

杰里科这是不幸的Java 。

那就是杰里科知道如何parsing和保存碎片。

是的，我知道它的Java，但是您可以轻松地使用一小部分Java来创build一个RESTful服务，这个服务将会使用有效负载并将其转换。在Java REST服务中，您可以使用JRuby，Jython，Rhino Javascript等与Jericho进行协调。

你可以使用Nokogiri的HTML片段来实现这个function：

 fragment = Nokogiri::HTML.fragment('<div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table>') fragment.css('.title').children.first.replace(Nokogiri::XML::Text.new('HEY', fragment)) frament.to_s #=> "<div class=\"header\"><span class=\"title\">HEY</span></div>\n<p>1</p><p>2\n</p><table><tr><td>1</td></tr></table>"

p标签的问题仍然存在，因为它是无效的HTML，但是这应该返回没有html，head或body和tbody标签的文档。

使用Python – 使用lxml.html是相当直接的：（它符合要点1和3，但我不认为可以做大约2，并处理无引号的class= 's）

 import lxml.html fragment = """<div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table> """ page = lxml.html.fromstring(fragment) for span in page.cssselect('.header .title'): span.text = 'my new value' print lxml.html.tostring(page, pretty_print=True)

结果：

 <div> <div class="header"><span class="title">my new content</span></div> <p>1</p> <p>2 </p> <table><tr><td>1</td></tr></table> </div>

这是一个稍微单独的解决scheme，但如果这只是一些简单的实例，那么也许CSS是答案。

生成的内容

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <style type="text/css"> #header.title1:first-child:before { content: "This is your title!"; display: block; width: 100%; } #header.title2:first-child:before { content: "This is your other title!"; display: block; width: 100%; } </style> </head> <body> <div id="header" class="title1"> <span class="non-title">Blah Blah Blah Blah</span> </div> </body> </html>

在这种情况下，你可以让jQuery交换类，你会得到免费的改变与CSS。我没有testing过这个特定的用法，但是它应该可以工作。

我们将这个用于中断消息。

如果你正在运行一个Node.js应用程序，这个模块将完成你想要的，一个JQuery风格的DOM操作器： https ： //github.com/cheeriojs/cheerio

他们的wiki的一个例子是：

 var cheerio = require('cheerio'), $ = cheerio.load('<h2 class="title">Hello world</h2>'); $('h2.title').text('Hello there!'); $('h2').addClass('welcome'); $.html(); //=> <h2 class="title welcome">Hello there!</h2>

parsingHTML并保留原始内容

通过terminal服务npm http-server时出错。（错误404）

使用Websockets和Nodejs上传文件

StyleDocco输出文档确实应用了CSS

在Node.js中，读取一个.html文件的目录并在其中search元素属性？

让gulp-file-include和browser-sync一起工作

帕格 – 编译后错误的嵌套 – 包括vs扩展

无法使用jQuery将html文件加载到另一个html文件

node.js socket.io不广播到连接的客户端

Socket.io简单节点应用程序不工作

协作绘画应用程序

parsingHTML并保留原始内容

通过terminal服务npm http-server时出错。 （错误404）

使用Websockets和Nodejs上传文件

StyleDocco输出文档确实应用了CSS

在Node.js中，读取一个.html文件的目录并在其中search元素属性？

让gulp-file-include和browser-sync一起工作

帕格 – 编译后错误的嵌套 – 包括vs扩展

无法使用jQuery将html文件加载到另一个html文件

node.js socket.io不广播到连接的客户端

Socket.io简单节点应用程序不工作

协作绘画应用程序

通过terminal服务npm http-server时出错。（错误404）