无法抓取ally.com

我能够抓取像nature.com，flipkart.com.is这样的webistes。它工作得很好。但当我尝试爬行ally.com，nike.com。它返回状态码403，并说未定义。这是我的代码

// crawlerqueue.js var request = require('request'); var cheerio = require('cheerio'); var URL = require('url-parse'); var pa11y=require('pa11y'); var START_URL = "http://www.nature.com/"; //var SEARCH_WORD = "stemming"; var MAX_PAGES_TO_VISIT = 100; var pagesVisited = {}; var numPagesVisited = 0; var pagesToVisit = []; var url = new URL(START_URL); var baseUrl = url.protocol + "//" + url.hostname; pagesToVisit.push(START_URL); crawl(); function crawl() { if(numPagesVisited >= MAX_PAGES_TO_VISIT) { console.log("Reached max limit of number of pages to visit."); return; } var nextPage = pagesToVisit.pop(); if (nextPage in pagesVisited) { // We've already visited this page, so repeat the crawl crawl(); } else { // New page we haven't visited visitPage(nextPage, crawl); } } function visitPage(url, callback) { // Add page to our set pagesVisited[url] = true; numPagesVisited++; // Make the request console.log("Visiting page " + url); request(url, function(error, response, body) { // Check status code (200 is HTTP OK) console.log("Status code: " + response.statusCode); if(response.statusCode !== 200) { callback(); return; } // Parse the document body var $ = cheerio.load(body); /*var isWordFound = searchForWord($, SEARCH_WORD); if(isWordFound) { console.log('Word ' + SEARCH_WORD + ' found at page ' + url); } else*/ { collectInternalLinks($); // In this short program, our callback is just calling crawl() callback(); } }); } function searchForWord($, word) { var bodyText = $('html > body').text().toLowerCase(); return(bodyText.indexOf(word.toLowerCase()) !== -1); } function collectInternalLinks($) { var relativeLinks = $("a[href^='/']"); console.log("Found " + relativeLinks.length + " relative links on page"); relativeLinks.each(function() { pagesToVisit.push(baseUrl + $(this).attr('href')); }); }

我通过命令行运行这个代码。下面是nature.com的输出：

 Visiting page http://www.nature.com/ Status code: 200 Found 23 relative links on page Visiting page http://www.nature.com/scitable/sponsors Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/pressnews Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/contact Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/about Status code: 200 Found 25 relative links on page Visiting page http://www.nature.com/scitable/my-profile/social-settings Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/photocredit Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/presscontact Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/presskit Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/pressroom Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/sponsorship Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/topicpage/copy-number- Status code: 200 Found 89 relative links on page Reached max limit of number of pages to visit.

但是当我尝试抓取nike.com或ally.com时，显示下面的错误

 Visiting page http://www.ally.com Status code: 403 Visiting page undefined C:\Users\dashboard-master\node_modules\request\index.js:45 throw new Error('undefined is not a valid uri or options object.') ^ Error: undefined is not a valid uri or options object. at request (C:\Users\dashboard-master\node_modules\request\ index.js:45:11) at visitPage (C:\Users\dashboard-master\config\crawlqueue.j s:41:3) at crawl (C:\Users\dashboard-master\config\crawlqueue.js:30 :5) at Request._callback (C:\Users\dashboard-master\config\crawlqueue.js:45:8) at Request.self.callback (C:\Users\dashboard-master\node_modules\request\request.js:188:22) at emitTwo (events.js:106:13) at Request.emit (events.js:191:7) at Request.<anonymous> (C:\Users\dashboard-master\node_modules\request\request.js:1171:10) at emitOne (events.js:96:13) at Request.emit (events.js:188:7)

它返回状态码403

ally.com位于Akamai Ghost服务器的后面，Akamai会以某种方式阻止抓取，同时也会给出一个错误参考。您可以在响应正文中查看这个信息，或者在X-Reference-Error下为我返回的信头看起来像这个18.5fcxx917.148981xxxx。 dacxsd6 。如果你想深入挖掘，你可以看看他们的API来翻译这里的错误参考。

并说未定义

首先在发出请求呼叫时检查错误。你直接检查response.statusCode ，你不知道你是否有一个响应或未定义的值。

在你的情况下，你正在调用crawl函数，并返回执行，如果你没有200成功，这意味着你将不会有nextpages抓取。

 var nextPage = pagesToVisit.pop();

在这里你popup一个空的数组（pagesToVisit是空的，因为你没有收集任何链接），所以nextPage将是undefined ，然后你传递相同的uri请求模块，使请求模块抛出一个错误。

你可以做的只是当你有一个长度> 0的数组时popup，或者像这样检查nextPage的值

 if(nextPage){ if (nextPage in pagesVisited) {

无法抓取ally.com

Node.js共享对象或单例以确保单一来源的真相

在Node中调用child_process.exec，就像在特定的文件夹中执行一样

如何使用TypeScript导入自定义的node.js插件模块

如何在节点源代码中findprocess.binding（'..'）中使用的源文件？

Angular2Fire安装错误

NodeJS警告：可能的事件发射器泄漏。 11个开放的听众添加

如何构build模块更贴近/坚持Node.js理念

使用子项目集中项目中的node_modules

当我改变git分支时切换node_modules文件夹

实际上排除文件发布没有validation它