Tag: cheerio

使用nodejssearch非结构化的html: 我需要爬行/报废一个静态的非结构化的HTML，我试图得到一个nodejs代码的内容，我尝试使用cheerio和xpath失败。 http://static.puertos.es/pred_simplificada/Predolas/Tablas/Cnt/PAS.html 第一个得到的元素的Xpath是/ html / body / center / center / table / tbody / tr [3]，然后我需要在TR中获取每个TD文本。如果尝试获取tbody节点 var parser = new parse5.Parser(); var document = parser.parse(response.toString()); var xhtml = xmlser.serializeToString(document); var doc = new dom().parseFromString(xhtml); var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"}); var nodes = select("//x:tbody", doc); 我总是收到一个[]节点。随着cheerio我尝试迭代TR元素，但正如我上面提到的失败。 var $ = cheerio.load(response); $('tr').each(function(i, e) { […]

从div内使用node.js检索文本: 我目前正试图写一个刮板，将获得所有的'p'标签内的一个div内使用node.js 页面上的每个post都在div中，都有这个类：.text_exposed_root 有时在每个post内有多个“p”标签，所以如果可能的话，我需要抓取该div内的所有html文本。我正在使用cheerio和请求模块，我的代码到目前为止如下： request(BTTS, function(error, response, body){ if (!error){ var $ = cheerio.load(body), post = $(".text_exposed_root p").text(); console.log(post); } else { console.log("We've encountered an error: " + error); } }) 我曾尝试使用.text。值和.html，但他们都只是返回一个空白的回应。我猜我需要抓住所有的'P'标签内的该部分，并转换为string也许？提前致谢。编辑： var url = ('https://www.facebook.com/BothTeamsToScore'); request({url:url, headers: headers}, function(error, response, body){ if (!error){ var strippedBody = body.replace(/<!–[\s\S]*?–>/g, "") console.log(strippedBody); var $ […]

Web在Cheerio中抓取HTML表格: 我有一个networking抓取项目的问题。这里是我需要刮的页面示例： <table style="position…"> <thead>..</thead> <tbody id="leaderboard_body"> <tr bgcolor="#155555">..</tr> <tr bgcolor="#155555">..</tr> <tr bgcolor="#155555">..</tr> … </tbody> </table> 有关更多详细信息，请参阅以下页面：世界排行榜我想访问tr标签内的信息，但我无法实现。我不能用简单的代码findtbody标签，就像这样，我不知道为什么： var cheerio = require("cheerio"); var url = "http://www.dota2.com/leaderboards/?l=french#europe"; var http = require("http"); // Utility function that downloads a URL and invokes // callback with the data. function download(url, callback) { http.get(url, function(res) { var data […]

使用cheerio获取div的所有子节点？: <div class="hello"> Text1 <li>Text2</li> <div class="bye">Text3</div> Text4 Block <div class="bye">Text5</div> Last Text5 </div> 所以我有上面我用$('div.hello')在cheerio中抓取。我想遍历它。如何迭代包括文本节点在内的所有内容？我尝试使用$('div.hello').contents()但这不是抓住文本节点（“文本1”，“文本4块”和“上一个文本5”）。我的最终目标是基本上拆分HTML块当我到达具有“再见”类的第一个元素。所以我想要一个数组持有以下的HTMLstring， final_array = ['Text1 <li>Text2</li>', '<div class="bye">Text3</div> Text4 Block <div class="bye">Text5</div> Last Text5']

从Cheerio.js内容中删除unicode字符: 我正在使用cheeriojs从网页中删除内容，并使用以下HTML。 <p> Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday. <br> <br> “The outcome will be made public in due course,” John said in an SMS yesterday. <br> <br> </p> 我可以通过class和id标签来获得感兴趣的内容，如下所示： $('.top-stories .line.more').each(function(i, el){ //Do something… let content = $(this).next().html(); } 一旦我捕获了感兴趣的内容，我使用正则expression式来“清理”它，如下所示： […]

select选项与cheerio（jQuery）的文字: 我试图从select下拉列表中select带有cheerio的HTML元素的text()属性的cheerio 。到目前为止，我正在做的是，我试图将attr设置为选定的匹配，具有特定文本的元素。这就是我想要的： $('#ddl_city option[text="testing"]').attr('selected', 'selected'); 当我这样做时，我没有得到任何HTML的变化。在完成这个操作之后，我在控制台上打印了html文档，但是我没有看到这个选项已经改变为我正在select的那个。任何想法为什么不工作或另一种解决方法？

Cheerio的每个循环都不会提前爆发: 在Cheerio / Jquery 文档中指出，返回false应该尽早打破每个循环。我有以下代码： "use strict"; let cheerio = require('cheerio'); let $ = cheerio.load('<a href="www.test.com"></a><a href="www.test.com"></a>'); $('a').each(function(index,value){ console.log(index); return false; }); 它应该在我的脑海中打印0到控制台，但它打印0和1.我错过了什么？

无法使用Cheerio获取iframe: 您好，我正在试图从wathfree.to获得一个iframe。这是我用于这个目的的代码： function getSecondBody(url2) { url2 = 'http://www.watchfree.to/watch-366-Forrest-Gump-movie-online-free-putlocker.html'; request(url2, function(err, resp, body) { var $ = cheerio.load(body); var embedcode = $('.links_left_container'); embedcodetext2 = embedcode.html(); console.log(embedcodetext2); return embedcodetext2; }); } 但返回的响应不包含我需要的iframe。这是我收到的回应：而实际的页面如下所示：我的回复中只有iframe部分丢失。

jsdom / cheerio大大改变HTML: 我想刮一个网站，我有问题与jsdom和cheerio戏剧性地改变他们得到的html。最值得注意的是，他们删除了一些标签，如table / tr / td标签等只要有一个本地文件说1.html和做： // with cheerio -> or equivalent with jsdom var $ = require('cheerio').load(fs.readFileSync(path)); fs.writeFileSync('2.html', $.html()); # bash $> diff 1.html 2.html ….. < <tr><td colspan="5"><a id="stats" name="stats"></a><div class="titlebar1" style="margin-top: 12px;margin-bottom: 4px;"><h2>Stats</h2><div class="element"><img src="img/element/10.png" /></div><div class="elementborder"><img src="img/elementborder.png" /></div></div></td></tr></table></td></div> — > <tr><td colspan="5"><a id="stats" name="stats"></a><div class="titlebar1" style="margin-top: 12px;margin-bottom: 4px;"><h2>Stats</h2><div class="element"><img src="img/element/10.png"></div><div […]

使用cheerio刮嵌套的xml: 我试图用cheerio来取消一些PubMed数据。以下脚本可以正常工作，但是当某个xml标签不存在时，它会生成错误的输出。 var request = require('request'), cheerio = require('cheerio'); request('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23545583,23103438', function(error, response, body) { var $ = cheerio.load(body); for (var i = 0; i < $('PubmedArticle').length; i++) { console.log($('PubmedArticle PMID').slice(0).eq(i).text()); console.log($('PubmedArticle DateCreated Year').slice(0).eq(i).text()); console.log($('PubmedArticle ArticleTitle').slice(0).eq(i).text()); console.log($('PubmedArticle Abstract AbstractText').slice(0).eq(i).text()); }; }); 在这个例子中，第一个标题下方的抽象输出代替第二个，因为第一个文章不包含抽象。

Tag: cheerio

使用nodejssearch非结构化的html

从div内使用node.js检索文本

Web在Cheerio中抓取HTML表格

使用cheerio获取div的所有子节点？

从Cheerio.js内容中删除unicode字符

select选项与cheerio（jQuery）的文字

Cheerio的每个循环都不会提前爆发

无法使用Cheerio获取iframe

jsdom / cheerio大大改变HTML

使用cheerio刮嵌套的xml

等待Mongoose查询完成

节点数组包含“对象”而不是实际的JS对象

如何在mongoose中构build多对多的关系？

获取autoform非集合表单的方法返回值

节点js脱机语音到文本

NodeJs，摩卡和mongoose

NodeJS V8将附加parameter passing给callback函数

在Heroku的节点上运行php

Node.js – 内联推送到数组

使用Bookshelf.js和knex.js进行unit testing

HTML5video在Firefox上缓慢加载，但在Chrome上非常快

Mongoose中的多文档上传

使用node-mocks-httptesting一个服务，没有事件发出

如何使用_bsontype属性处理来自mongoDB的文档

在哪里存储JWT撤销信息，mongoDB或Redis？