java爬蟲技術(shù)之如何使用Java制作網(wǎng)絡(luò)爬蟲？

 天津卓眾教育  2022-03-18 11:30:01  15

java爬蟲技術(shù)之如何使用Java制作網(wǎng)絡(luò)爬蟲？,如何使用Java制作簡單的Web爬網(wǎng)程序原型。制作Web搜尋器并不像聽起來那樣困難。只需按照指南進行操作，您將在1小時或更短的時間

課程價格請咨詢

上課時段：授課校區(qū)：

詳細介紹

如何使用Java制作簡單的Web爬網(wǎng)程序原型。制作Web搜尋器并不像聽起來那樣困難。只需按照指南進行操作，您將在1小時或更短的時間內(nèi)迅速到達該地點，然后享受它可以為您提供的大量信息。由于這只是一個原型，因此您需要花費更多時間來根據(jù)需要自定義它。

以下是本教程的先決條件：

·基本Java程式設(shè)計

·關(guān)于SQL和MySQL數(shù)據(jù)庫的一些知識。

如果您不想使用數(shù)據(jù)庫，則可以使用文件來跟蹤爬網(wǎng)歷史記錄。

1.目標

在本教程中，目標如下：

給定學(xué)校根URL，例如"mit.edu"，返回包含該學(xué)校字符串"research"的所有頁面

典型的搜尋器按以下步驟工作：

1.解析根網(wǎng)頁（"mit.edu"），并從該頁面獲取所有鏈接。要訪問每個URL并解析HTML頁面，我將使用JSoup，它是用Java編寫的便捷的網(wǎng)頁解析器。

2.使用從步驟1檢索到的URL，并解析這些URL

3.執(zhí)行上述步驟時，我們需要跟蹤之前已處理過的頁面，因此每個網(wǎng)頁僅被處理一次。這就是我們需要數(shù)據(jù)庫的原因。

2.設(shè)置MySQL數(shù)據(jù)庫

如果您使用的是Ubuntu，則可以按照本指南安裝Apache，MySQL，PHP和phpMyAdmin。

如果使用Windows，則只需使用WampServer。您可以簡單地從wampserver.com下載它，并在一分鐘內(nèi)安裝它，可以繼續(xù)進行下一步。

我將使用phpMyAdmin來操作MySQL數(shù)據(jù)庫。它只是使用MySQL的GUI界面。如果您使用任何其他工具或不使用GUI工具，那都很好。

3.創(chuàng)建一個數(shù)據(jù)庫和一個表

創(chuàng)建一個名為"Crawler"的數(shù)據(jù)庫，并創(chuàng)建一個名為"Record"的表，如下所示：

CREATE TABLE IF NOT EXISTS`Record`(`RecordID`INT(11)NOT NULL AUTO_INCREMENT,`URL`text NOT NULL,PRIMARY KEY(`RecordID`))ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;

4.開始使用Java進行爬網(wǎng)

1）下載JSoup核心庫。

2）現(xiàn)在，在Jsoup中創(chuàng)建一個名為"Crawler"的項目，并將您下載的JSoup和mysql-connector jar文件添加到Java Build Path。（右鍵單擊項目->選擇"構(gòu)建路徑"->"配置構(gòu)建路徑"->單擊"庫"選項卡->單擊"添加外部JAR"）

3）創(chuàng)建一個名為"DB"的類，該類用于處理數(shù)據(jù)庫操作。

import java.sql.Connection;import java.sql.DriverManager;import java.sql.ResultSet;import java.sql.SQLException;import java.sql.Statement;public class DB{public Connection conn=null;public DB(){try{Class.forName("com.mysql.jdbc.Driver");String url="jdbc:mysql://localhost:3306/Crawler";conn=DriverManager.getConnection(url,"root","admin213");System.out.println("conn built");}catch(SQLException e){e.printStackTrace();}catch(ClassNotFoundException e){e.printStackTrace();}}public ResultSet runSql(String sql)throws SQLException{Statement sta=conn.createStatement();return sta.executeQuery(sql);}public boolean runSql2(String sql)throws SQLException{Statement sta=conn.createStatement();return sta.execute(sql);}Overrideprotected void finalize()throws Throwable{if(conn!=null||!conn.isClosed()){conn.close();}}}

4）創(chuàng)建一個名稱為"Main"的類，它將作為我們的搜尋器。

import java.io.IOException;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import java.sql.Statement;import org.jsoup.Jsoup;import org.jsoup.nodes.document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class Main{public static DB db=new DB();public static void main(String[]args)throws SQLException,IOException{db.runSql2("TRUNCATE Record;");processPage("http://www.mit.edu");}public static void processPage(String URL)throws SQLException,IOException{//check if the given URL is already in databaseString sql="select*from Record where URL='"+URL+"'";ResultSet rs=db.runSql(sql);if(rs.next()){}else{//store the URL to database to avoid parsing againsql="INSERT INTO`Crawler`.`Record`"+"(`URL`)VALUES"+"(?);";PreparedStatement stmt=db.conn.prepareStatement(sql,Statement.RETURN_GENERATED_KEYS);stmt.setString(1,URL);stmt.execute();//get useful informationdocument doc=Jsoup.connect("http://www.mit.edu/").get();if(doc.text().contains("research")){System.out.println(URL);}//get all links and recursively call the processPage methodElements questions=doc.select("a[href]");for(Element link:questions){if(link.attr("href").contains("mit.edu"))processPage(link.attr("abs:href"));}}}}