无法抓取ally.com

我能够抓取像nature.com,flipkart.com.is这样的webistes。它工作得很好。 但当我尝试爬行ally.com,nike.com。 它返回状态码403,并说未定义。 这是我的代码

// crawlerqueue.js var request = require('request'); var cheerio = require('cheerio'); var URL = require('url-parse'); var pa11y=require('pa11y'); var START_URL = "http://www.nature.com/"; //var SEARCH_WORD = "stemming"; var MAX_PAGES_TO_VISIT = 100; var pagesVisited = {}; var numPagesVisited = 0; var pagesToVisit = []; var url = new URL(START_URL); var baseUrl = url.protocol + "//" + url.hostname; pagesToVisit.push(START_URL); crawl(); function crawl() { if(numPagesVisited >= MAX_PAGES_TO_VISIT) { console.log("Reached max limit of number of pages to visit."); return; } var nextPage = pagesToVisit.pop(); if (nextPage in pagesVisited) { // We've already visited this page, so repeat the crawl crawl(); } else { // New page we haven't visited visitPage(nextPage, crawl); } } function visitPage(url, callback) { // Add page to our set pagesVisited[url] = true; numPagesVisited++; // Make the request console.log("Visiting page " + url); request(url, function(error, response, body) { // Check status code (200 is HTTP OK) console.log("Status code: " + response.statusCode); if(response.statusCode !== 200) { callback(); return; } // Parse the document body var $ = cheerio.load(body); /*var isWordFound = searchForWord($, SEARCH_WORD); if(isWordFound) { console.log('Word ' + SEARCH_WORD + ' found at page ' + url); } else*/ { collectInternalLinks($); // In this short program, our callback is just calling crawl() callback(); } }); } function searchForWord($, word) { var bodyText = $('html > body').text().toLowerCase(); return(bodyText.indexOf(word.toLowerCase()) !== -1); } function collectInternalLinks($) { var relativeLinks = $("a[href^='/']"); console.log("Found " + relativeLinks.length + " relative links on page"); relativeLinks.each(function() { pagesToVisit.push(baseUrl + $(this).attr('href')); }); } 

我通过命令行运行这个代码。 下面是nature.com的输出:

 Visiting page http://www.nature.com/ Status code: 200 Found 23 relative links on page Visiting page http://www.nature.com/scitable/sponsors Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/pressnews Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/contact Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/about Status code: 200 Found 25 relative links on page Visiting page http://www.nature.com/scitable/my-profile/social-settings Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/photocredit Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/presscontact Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/presskit Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/pressroom Status code: 200 Found 26 relative links on page Visiting page http://www.nature.com/scitable/sponsorship Status code: 200 Found 22 relative links on page Visiting page http://www.nature.com/scitable/topicpage/copy-number- Status code: 200 Found 89 relative links on page Reached max limit of number of pages to visit. 

但是当我尝试抓取nike.com或ally.com时,显示下面的错误

 Visiting page http://www.ally.com Status code: 403 Visiting page undefined C:\Users\dashboard-master\node_modules\request\index.js:45 throw new Error('undefined is not a valid uri or options object.') ^ Error: undefined is not a valid uri or options object. at request (C:\Users\dashboard-master\node_modules\request\ index.js:45:11) at visitPage (C:\Users\dashboard-master\config\crawlqueue.j s:41:3) at crawl (C:\Users\dashboard-master\config\crawlqueue.js:30 :5) at Request._callback (C:\Users\dashboard-master\config\crawlqueue.js:45:8) at Request.self.callback (C:\Users\dashboard-master\node_modules\request\request.js:188:22) at emitTwo (events.js:106:13) at Request.emit (events.js:191:7) at Request.<anonymous> (C:\Users\dashboard-master\node_modules\request\request.js:1171:10) at emitOne (events.js:96:13) at Request.emit (events.js:188:7) 

它返回状态码403

ally.com位于Akamai Ghost服务器的后面,Akamai会以某种方式阻止抓取,同时也会给出一个错误参考。您可以在响应正文中查看这个信息,或者在X-Reference-Error下为我返回的信头看起来像这个18.5fcxx917.148981xxxx。 dacxsd6 。 如果你想深入挖掘,你可以看看他们的API来翻译这里的错误参考。

并说未定义

首先在发出请求呼叫时检查错误。 你直接检查response.statusCode ,你不知道你是否有一个响应或未定义的值。

在你的情况下,你正在调用crawl函数,并返回执行,如果你没有200成功,这意味着你将不会有nextpages抓取。

 var nextPage = pagesToVisit.pop(); 

在这里你popup一个空的数组(pagesToVisit是空的,因为你没有收集任何链接),所以nextPage将是undefined ,然后你传递相同的uri请求模块,使请求模块抛出一个错误。

你可以做的只是当你有一个长度> 0的数组时popup,或者像这样检查nextPage的值

 if(nextPage){ if (nextPage in pagesVisited) {