如何使用幻像爬虫将html源代码打印到控制台

我刚刚下载并安装了phantom-crawler for nodejs。 我将以下脚本复制并粘贴到名为crawler.js的文件中:

var Crawler = require('phantom-crawler'); // Can be initialized with optional options object var crawler = new Crawler(); // queue is an array of URLs to be crawled crawler.queue.push('https://google.com/'); // Can also do `crawler.fetch(url)` instead of pushing it and crawling it // Extract plainText out of each phantomjs page Promise.all(crawler.crawl()) .then(function(pages) { var texts = []; for (var i = 0; i < pages.length; i++) { var page = pages[i]; // suffix Promise to return promises instead of callbacks var text = page.getPromise('plainText'); texts.push(text); text.then(function(p) { return function() { // Pages are like tabs, they should be closed p.close() } }(page)); } return Promise.all(texts); }) .then(function(texts) { // texts = array of plaintext from the website bodies // also supports ajax requests console.log(texts); }) .then(function () { // kill that phantomjs bridge crawler.phantom.then(function (p) { p.exit(); }); }) 

我想打印完整的HTML源代码(在这种情况下,从谷歌页面)到控制台。

我搜查了很多,但是我还没有find类似的东西,那我该怎么做?

获取content而不是plainText承诺。

模块幻影爬虫使用phantomjs模块node-phantom-simple 。

您可以在phantomjs wiki中find您可以调用的属性列表。

 var Crawler = require('phantom-crawler'); // Can be initialized with optional options object var crawler = new Crawler(); // queue is an array of URLs to be crawled crawler.queue.push('https://google.com/'); // Can also do `crawler.fetch(url)` instead of pushing it and crawling it // Extract plainText out of each phantomjs page Promise.all(crawler.crawl()) .then(function(pages) { var allHtml = []; for (var i = 0; i < pages.length; i++) { var page = pages[i]; // suffix Promise to return promises instead of callbacks var html = page.getPromise('content'); allHtml.push(html); html.then(function(p) { return function() { // Pages are like tabs, they should be closed p.close() } }(page)); } return Promise.all(allHtml); }) .then(function(allHtml) { // allHtml = array of plaintext from the website bodies // also supports ajax requests console.log(allHtml); }) .then(function () { // kill that phantomjs bridge crawler.phantom.then(function (p) { p.exit(); }); })