节点j中的增量式和非增量式urls,带有cheerio和请求

我试图从以下方式使用cheerio和请求的页面刮取数据:

  • 1)转到url 1a( http://example.com/0 )
  • 2)提取url 1b( http://example2.com/52 )
  • 3)去url 1b
  • 4)提取一些数据并保存
  • 5)去url1a + 1( http://example.com/1 ,我们称之为2a)
  • 6)提取url2b( http://example2.com/693 )
  • 7)去url2b
  • 8)提取一些数据并保存等…

我正在努力解决如何做到这一点(注意,我只是熟悉节点js和cheerio /请求这个任务,即使它可能不优雅,所以不寻找替代库或语言来做到这一点,对不起) 。 我想我错过了一些东西,因为我甚至不能想到这是如何工作的。


编辑

让我以另一种方式尝试。 这里是代码的第一部分:

var request = require('request'), cheerio = require('cheerio'); request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html, { xmlMode: true }); var id = ($('work').attr('id')) var total = ($('record').attr('total')) } }); 

第一个返回的页面是这样的

 <response> <query>date:[2000 TO 2014]</query> <zone name="book"> <records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1"> <work id="189231549" url="/work/189231549"> <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl> <title> Design of physiological control and magnetic levitation systems for a total artificial heart </title> <contributor>Greatrex, Nicholas Anthony</contributor> <issued>2014</issued> <type>Thesis</type> <holdingsCount>1</holdingsCount> <versionCount>1</versionCount> <relevance score="0.001961126">vaguely relevant</relevance> <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier> </work> </records> </zone> </response> 

上面的URL需要增加s = 0,s = 1等等的“总”次数。 'id'需要在第二个请求中input到下面的url中:

 request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html, { xmlMode: true }); //extract data here etc. } }); 

例如,当使用由第一个请求返回的id =“189231549”时,第二个返回的页面看起来像这样

 <work id="189231549" url="/work/189231549"> <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl> <title> Design of physiological control and magnetic levitation systems for a total artificial heart </title> <contributor>Greatrex, Nicholas Anthony</contributor> <issued>2014</issued> <type>Thesis</type> <subject>Total Artificial Heart</subject> <subject>Magnetic Levitation</subject> <subject>Physiological Control</subject> <abstract> Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation. </abstract> <language>English</language> <holdingsCount>1</holdingsCount> <versionCount>1</versionCount> <tagCount>0</tagCount> <commentCount>0</commentCount> <listCount>0</listCount> <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier> </work> 

所以我现在的问题是如何将这两个部分(循环)绑在一起来实现结果(下载和parsing大约70000页)?

我不知道如何在Node.js的JavaScript代码。 我是JavaScript新手

您可以通过研究现有的着名网站复印机(封闭源代码或开源代码)来了解如何做到这一点,

例如 – 使用http://www.tenmax.com/teleport/pro/home.htm的试用版废弃您的网页,然后尝试与http://www.httrack.com相同,您应该了解如何他们做得很清楚(以及如何做到这一点)&#x3002;

关键的编程概念是lookup cachetask queue

如果您的解决scheme应该扩展到多个node.js工作进程和多达多个页面,recursion并不是成功的概念

编辑:澄清意见后

在开始将您的报废引擎重新加工为更大规模的架构之前,作为一名新的Node.js开发人员,您可以简单地使用由@ lucio -m创build的wait.for包提供的Node.jscallback地狱 TATO。

下面的代码用你提供的链接为我工作

 var request = require('request'); var cheerio = require('cheerio'); var wait = require("wait.for"); function requestWaitForWrapper(url, callback) { request(url, function(error, response, html) { if (error) callback(error, response); else if (response.statusCode == 200) callback(null, html); else callback(new Error("Status not 200 OK"), response); }); } function readBookInfo(baseUrl, s) { var html = wait.for(requestWaitForWrapper, baseUrl + '&s=' + s.toString()); var $ = cheerio.load(html, { xmlMode: true }); return { s: s, id: $('work').attr('id'), total: parseInt($('records').attr('total')) }; } function readWorkInfo(id) { var html = wait.for(requestWaitForWrapper, 'http://api.trove.nla.gov.au/work/' + id.toString() + '?key=6k6oagt6ott4ohno&reclevel=full'); var $ = cheerio.load(html, { xmlMode: true }); return { title: $('title').text(), contributor: $('contributor').text() } } function main() { var baseBookUrl = 'http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1'; var baseInfo = readBookInfo(baseBookUrl, 0); for (var s = 0; s < baseInfo.total; s++) { var bookInfo = readBookInfo(baseBookUrl, s); var workInfo = readWorkInfo(bookInfo.id); console.log(bookInfo.id + ";" + workInfo.contributor + ";" + workInfo.title); } } wait.launchFiber(main); 

您可以使用额外的asynchronous模块来处理多个请求并通过多个页面进行迭代。 在这里阅读更多关于asynchronoushttps://github.com/caolan/async 。