用PHP获取网页上的信息相对于xpath效率低点

发布时间：2023-09-06 01:06责任编辑：胡小海关键词：PHP

用php实现对网页的抓取，及信息的收集，其实就是爬数据，具体实现步骤如下，首先应引入两个文件curl_html_get.php和save_file.php文件，两个文件具体代码是这样的curl_html_get.php内代码为

<?php

function curl_get_file_contents($url)

{

???$c = curl_init();

???curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);

???curl_setopt($c, CURLOPT_URL, $url);

???$contents = curl_exec($c);

???curl_close($c);

???if ($contents)

???????return $contents;

???else

???????return FALSE;

}

save_file.php文件内容是

<?php

/**

* 连续创建目录

* @param string $dir 目录字符串

* @param int $mode 权限数字

* @return boolean

function make_dir($dir, $mode = "0777") {

???if (!$dir)

???????return false;

???if(!file_exists($dir)) {

???????return mkdir($dir,$mode,true);

???} else {

???????return true;

???}

}

/**

* 保存文件

* @param string $fileName 文件名（含相对路径）

* @param string $text 文件内容

* @return boolean

function save_file($filename, $text) {

???if (!$filename || !$text)

???????return false;

???$dirname = dirname($filename);

???if (make_dir($dirname)) {

// ???????file_put_contents($filename, $text, FILE_APPEND);

?????????file_put_contents($filename, $text);

// ????????if (is_resource($fp = fopen($filename, "w+"))) {

// ????????????if (@fwrite($fp, $text)) {

// ????????????????fclose($fp);

// ????????????????return true;

// ????????????} else {

// ????????????????fclose($fp);

// ????????????????return false;

// ????????????}

// ????????}

???}

???return false;

}

其实就是一个是获取网页内容的，另一个是创建文件的。

然后就是PHP代码了，自己定义一个函数函数内代码基本是这样的

echo "==================start=======================<br />";

// 1、获取网页

$path = THIS_PATH . "download";

$url = "http://10.maigoo.com/list_1187.html";

$pathinfo = pathinfo($url);

$html_pathname = $path . DS;

$html_filename = $html_pathname . "list_1187.htm";

if (!file_exists($html_filename)) {

$text = curl_get_file_contents($url);

save_file($html_filename, $text);

} else {

$text = file_get_contents($html_filename);

}

???// 2、获取区域

???//start pos

???$start = ‘<div class="b-brand-nlist hoverdetail">‘;

???//end pos

???$end = ‘<div id="copyright">‘;

???$pos_start = strpos($text, $start);

???$pos_end = strpos($text, $end, $pos_start);

???$pos_end += strlen($end);

???$content = substr($text, $pos_start, $pos_end-$pos_start);

???save_file($html_pathname."list_1187.html", $content);

???// 3、获取所有的一级

???$pattern = ‘@<div class="aclist">.*<div class="clear"></div>@Usi‘; ????

???if (!preg_match_all($pattern, $content, $matches)) {

???????die("===============not match anything===================<"); ???????

???}

???echo "=========================================<br />";

???$index = 0;

???foreach ($matches[0] as $pinpai_cate) {

???????save_file($html_pathname. $index . ".html", $pinpai_cate);

???????// 获得一级分类 url 和 name

???????get_level1_url_and_name($pinpai_cate, $cate1_url, $cate1_name);

???????// echo "==================$一个品牌=======================<br />"; ?

???????$pattern = ‘@<li addbg="#400143".*</li>@Usi‘; ????

???????if (preg_match_all($pattern, $content, $matches)) {

???????????foreach($matches[0] as $one_brand);

???????}

???}

???echo "==================end=======================<br />";

}

基本原理就是先获取下载网页到本地，然后截取，最后用正则匹配。自己做得过程中没有对代码进行调优，导致代码太长，重复的地方太多，若截取的地方用正则还是无法判断，或者说区域有很多重复点，就需要再次截取接着排除干扰，比较繁琐，另外需要多写函数，把所有代码优化之后才能更深入提高自己水平。

用PHP获取网页上的信息相对于xpath效率低点

原文地址：http://www.cnblogs.com/xinyu2017/p/7424458.html

用PHP获取网页上的信息相对于xpath效率低点

知识推荐