悲观
Posted by admin
阳光灿烂的日子,我怀念那种感觉;生活的压力,让灿烂变成了烈日,我亦变成在烈日下低头流汗的人。一个悲观的人,总会从失落的角度去看待问题,能够看的到得未来,显然并不精彩。悄然,已至而立。悄然一蹉跎,不知又会是多少年。
PHP抓取网页内容汇总
Posted by admin
①、使用php获取网页内容
http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html
header(”Content-type: text/html; charset=utf-8″);
1、
$xhr = new COM(”MSXML2.XMLHTTP”);
$xhr->open(”GET”,”http://localhost/xxx.php?id=2″,false);
$xhr->send();
echo $xhr->responseText
2、file_get_contents实现
<?php
$url=”http://www.blogjava.net/pts“;
echo file_get_contents( $url );
?>
3、fopen()实现
<?
if ($stream = fopen(’http://www.sohu.com’, ‘r’)) {
// print all the page starting at the offset 10
echo stream_get_contents($stream, -1, 10);
fclose($stream);
}
if ($stream = fopen(’http://www.sohu.net’, ‘r’)) {
// print the first 5 bytes
echo stream_get_contents($stream, 5);
fclose($stream);
}
?>
②、使用php获取网页内容
http://www.blogjava.net/pts/archive/2007/08/26/99188.html
简单的做法:
<?php
$url=”http://www.blogjava.net/pts“;
echo file_get_contents( $url );
?>
或者:
<?
if ($stream = fopen(’http://www.sohu.com’, ‘r’)) {
// print all the page starting at the offset 10
echo stream_get_contents($stream, -1, 10);
fclose($stream);
}
if ($stream = fopen(’http://www.sohu.net’, ‘r’)) {
// print the first 5 bytes
echo stream_get_contents($stream, 5);
fclose($stream);
}
?>
③、PHP获取网站内容,保存为TXT文件源码
http://blog.chinaunix.net/u1/44325/showart_348444.html
<?
$my_book_url=’http://book.yunxiaoge.com/files/article/html/4/4550/index.html’;
ereg(”http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/”,$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, “r”);//读取文件
unlink(”test.txt”);
while (!feof($file_handle)) { //循环到文件结束
$line = fgets($file_handle); //读取一行文件
$line1=ereg(”href=\”[0-9]+.html”,$line,$reg); //分析文件内部书的文章页面
$handle = fopen(”test.txt”, ‘a’);
if ($line1) {
$my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
$my_book_txt_url=str_replace(”href=\”",”",$my_book_txt_url);
$my_book_txt_over_url=”$my_book_txt$my_book_txt_url”; //转换为抓取地址
echo “$my_book_txt_over_url</p>”; //显示工作状态
$file_handle_txt = fopen($my_book_txt_over_url, “r”); //读取转换后的抓取地址
while (!feof($file_handle_txt)) {
$line_txt = fgets($file_handle_txt);
$line1=ereg(”^ .+”,$line_txt,$reg); //根据抓取内容标示抓取
$my_over_txt=$reg[0];
$my_over_txt=str_replace(” ”,” “,$my_over_txt); //过滤字符
$my_over_txt=str_replace(”<br />”,”",$my_over_txt);
$my_over_txt=str_replace(”<script. language=\”javascript\”>”,”",$my_over_txt);
$my_over_txt=str_replace(”"”,”",$my_over_txt);
if ($line1) {
$handle1=fwrite($handle,”$my_over_txt\n”); //写入文件
}
}
}
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo “完成</p>”;
?>
下面是比较嚣张的方法。
这里使用一个名叫Snoopy的类。
先是在这里看到的:
PHP中获取网页内容的Snoopy包
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网:
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明:
代码收藏-Snoopy类及简单的使用方法
http://blog.passport86.com/?p=161
下载:http://sourceforge.net/projects/snoopy/
今天才发现这个好东西,赶紧去下载了来看看,是用的parse_url
还是比较习惯curl
snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
下面是它的一些特征:
1、方便抓取网页的内容
2、方便抓取网页的文字(去掉HTML代码)
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向,并能控制转向深度
8、能把网页中的链接扩展成高质量的url(默认)
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架(v0.92增加)
11、支持再转向的时候传递cookies
具体使用请看下载文件中的说明。
<?php
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$snoopy->fetchform(“http://www.phpx.com/happy/logging.php?action=login“);
print$snoopy->results;
?>
<?php
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$submit_url=“http://www.phpx.com/happy/logging.php?action=login“;$submit_vars[”loginmode”]=“normal“;
$submit_vars[”styleid”]=“1“;
$submit_vars[”cookietime”]=“315360000“;
$submit_vars[”loginfield”]=“username“;
$submit_vars[”username”]=“********“;//你的用户名
$submit_vars[”password”]=“*******“;//你的密码
$submit_vars[”questionid”]=“0“;
$submit_vars[”answer”]=“”;
$submit_vars[”loginsubmit”]=“提 交“;
$snoopy->submit($submit_url,$submit_vars);
print$snoopy->results;?>
下面是Snoopy的Readme
NAME:
Snoopy - the PHP net client v1.2.4
SYNOPSIS:
include “Snoopy.class.php”;
$snoopy = new Snoopy;
$snoopy->fetchtext(”http://www.php.net/“);
print $snoopy->results;
$snoopy->fetchlinks(”http://www.phpbuilder.com/“);
print $snoopy->results;
$submit_url = “http://lnk.ispi.net/texis/scripts/msearch/netsearch.html“;
$submit_vars[”q”] = “amiga”;
$submit_vars[”submit”] = “Search!”;
$submit_vars[”searchhost”] = “Altavista”;
$snoopy->submit($submit_url,$submit_vars);
print $snoopy->results;
$snoopy->maxframes=5;
$snoopy->fetch(”http://www.ispi.net/“);
echo “<PRE>\n”;
echo htmlentities($snoopy->results[0]);
echo htmlentities($snoopy->results[1]);
echo htmlentities($snoopy->results[2]);
echo “</PRE>\n”;
$snoopy->fetchform(”http://www.altavista.com“);
print $snoopy->results;
DESCRIPTION:
What is Snoopy?
Snoopy is a PHP class that simulates a web browser. It automates the
task of retrieving web page content and posting forms, for example.
Some of Snoopy’s features:
* easily fetch the contents of a web page
* easily fetch the text from a web page (strip html tags)
* easily fetch the the links from a web page
* supports proxy hosts
* supports basic user/pass authentication
* supports setting user_agent, referer, cookies and header content
* supports browser redirects, and controlled depth of redirects
* expands fetched links to fully qualified URLs (default)
* easily submit form. data and retrieve the results
* supports following html frames (added v0.92)
* supports passing cookies on redirects (added v0.92)
REQUIREMENTS:
Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
which should be PHP 3.0.9 and up. For read timeout support, it requires
PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.
CLASS METHODS:
fetch($URI)
———–
This is the method used for fetching the contents of a web page.
$URI is the fully qualified URL of the page to fetch.
The results of the fetch are stored in $this->results.
If you are fetching frames, then $this->results
contains each frame. fetched in an array.
fetchtext($URI)
—————
This behaves exactly like fetch() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.
fetchform($URI)
—————
This behaves exactly like fetch() except that it only returns
the form. elements from the page, stripping out html tags and other
irrelevant data.
fetchlinks($URI)
—————-
This behaves exactly like fetch() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.
submit($URI,$formvars)
———————-
This submits a form. to the specified $URI. $formvars is an
array of the form. variables to pass.
submittext($URI,$formvars)
————————–
This behaves exactly like submit() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.
submitlinks($URI)
—————-
This behaves exactly like submit() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.
CLASS VARIABLES: (default value in parenthesis)
$host the host to connect to
$port the port to connect to
$proxy_host the proxy host to use, if any
$proxy_port the proxy port to use, if any
$agent the user agent to masqerade as (Snoopy v0.1)
$referer referer information to pass, if any
$cookies cookies to pass if any
$rawheaders other header info to pass, if any
$maxredirs maximum redirects to allow. 0=none allowed. (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether or not to expand links to fully qualified URLs (true)
$user authentication username, if any
$pass authentication password, if any
$accept http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error where errors are sent, if any
$response_code responde code returned from server
$headers headers returned from server
$maxlength max return data length
$read_timeout timeout on read operations (requires PHP 4 Beta 4+)
set to 0 to disallow timeouts
$timed_out true if a read operation timed out (requires PHP 4 Beta 4+)
$maxframes number of frames we will follow
$status http status of fetch
$temp_dir temp directory that the webserver can write to. (/tmp)
$curl_path system path to cURL binary, set to false if none
EXAMPLES:
Example: fetch a web page and display the return headers and
the contents of the page (html-escaped):
include “Snoopy.class.php”;
$snoopy = new Snoopy;
$snoopy->user = “joe”;
$snoopy->pass = “bloe”;
if($snoopy->fetch(”http://www.slashdot.org/“))
{
echo “response code: “.$snoopy->response_code.”<br>\n”;
while(list($key,$val) = each($snoopy->headers))
echo $key.”: “.$val.”<br>\n”;
echo “<p>\n”;
echo “<PRE>”.htmlspecialchars($snoopy->results).”</PRE>\n”;
}
else
echo “error fetching document: “.$snoopy->error.”\n”;
Example: submit a form. and print out the result headers
and html-escaped page:
include “Snoopy.class.php”;
$snoopy = new Snoopy;
$submit_url = “http://lnk.ispi.net/texis/scripts/msearch/netsearch.html“;
$submit_vars[”q”] = “amiga”;
$submit_vars[”submit”] = “Search!”;
$submit_vars[”searchhost”] = “Altavista”;
if($snoopy->submit($submit_url,$submit_vars))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.”: “.$val.”<br>\n”;
echo “<p>\n”;
echo “<PRE>”.htmlspecialchars($snoopy->results).”</PRE>\n”;
}
else
echo “error fetching document: “.$snoopy->error.”\n”;
Example: showing functionality of all the variables:
include “Snoopy.class.php”;
$snoopy = new Snoopy;
$snoopy->proxy_host = “my.proxy.host”;
$snoopy->proxy_port = “8080″;
$snoopy->agent = “(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)”;
$snoopy->referer = “http://www.microsnot.com/“;
$snoopy->cookies[”SessionID”] = 238472834723489l;
$snoopy->cookies[”favoriteColor”] = “RED”;
$snoopy->rawheaders[”Pragma”] = “no-cache”;
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;
$snoopy->user = “joe”;
$snoopy->pass = “bloe”;
if($snoopy->fetchtext(”http://www.phpbuilder.com“))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.”: “.$val.”<br>\n”;
echo “<p>\n”;
echo “<PRE>”.htmlspecialchars($snoopy->results).”</PRE>\n”;
}
else
echo “error fetching document: “.$snoopy->error.”\n”;
Example: fetched framed content and display the results
include “Snoopy.class.php”;
$snoopy = new Snoopy;
$snoopy->maxframes = 5;
if($snoopy->fetch(”http://www.ispi.net/“))
{
echo “<PRE>”.htmlspecialchars($snoopy->results[0]).”</PRE>\n”;
echo “<PRE>”.htmlspecialchars($snoopy->results[1]).”</PRE>\n”;
echo “<PRE>”.htmlspecialchars($snoopy->results[2]).”</PRE>\n”;
}
else
echo “error fetching document: “.$snoopy->error.”\n”;
php Snoopy手册 中文版
Posted by admin
Snoopy是什么?
Snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
Snoopy的一些特点:
* 方便抓取网页的内容
* 方便抓取网页的文本内容 (去除HTML标签)
* 方便抓取网页的链接
* 支持代理主机
* 支持基本的用户名/密码验证
* 支持设置 user_agent, referer(来路), cookies 和 header content(头文件)
* 支持浏览器转向,并能控制转向深度
* 能把网页中的链接扩展成高质量的url(默认)
* 方便提交数据并且获取返回值
* 支持跟踪HTML框架(v0.92增加)
* 支持再转向的时候传递cookies (v0.92增加)
要求:
Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
which should be PHP 3.0.9 and up. For read timeout support, it requires
PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.
类方法:
fetch($URI)
———–
这是为了抓取网页的内容而使用的方法。
$URI参数是被抓取网页的URL地址。
抓取的结果被存储在 $this->results 中。
如果你正在抓取的是一个框架,Snoopy将会将每个框架追踪后存入数组中,然后存入 $this->results。
fetchtext($URI)
—————
本方法类似于fetch(),唯一不同的就是本方法会去除HTML标签和其他的无关数据,只返回网页中的文字内容。
fetchform($URI)
—————
本方法类似于fetch(),唯一不同的就是本方法会去除HTML标签和其他的无关数据,只返回网页中表单内容(form)。
fetchlinks($URI)
—————-
本方法类似于fetch(),唯一不同的就是本方法会去除HTML标签和其他的无关数据,只返回网页中链接(link)。
默认情况下,相对链接将自动补全,转换成完整的URL。
submit($URI,$formvars)
———————-
本方法向$URL指定的链接地址发送确认表单。$formvars是一个存储表单参数的数组。
submittext($URI,$formvars)
————————–
本方法类似于submit(),唯一不同的就是本方法会去除HTML标签和其他的无关数据,只返回登陆后网页中的文字内容。
submitlinks($URI)
—————-
本方法类似于submit(),唯一不同的就是本方法会去除HTML标签和其他的无关数据,只返回网页中链接(link)。
默认情况下,相对链接将自动补全,转换成完整的URL。
类属性: (缺省值在括号里)
$host 连接的主机
$port 连接的端口
$proxy_host 使用的代理主机,如果有的话
$proxy_port 使用的代理主机端口,如果有的话
$agent 用户代理伪装 (Snoopy v0.1)
$referer 来路信息,如果有的话
$cookies cookies, 如果有的话
$rawheaders 其他的头信息, 如果有的话
$maxredirs 最大重定向次数, 0=不允许 (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks 是否将链接都补全为完整地址 (true)
$user 认证用户名, 如果有的话
$pass 认证用户名, 如果有的话
$accept http 接受类型 (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error 哪里报错, 如果有的话
$response_code 从服务器返回的响应代码
$headers 从服务器返回的头信息
$maxlength 最长返回数据长度
$read_timeout 读取操作超时 (requires PHP 4 Beta 4+)
设置为0为没有超时
$timed_out 如果一次读取操作超时了,本属性返回 true (requires PHP 4 Beta 4+)
$maxframes 允许追踪的框架最大数量
$status 抓取的http的状态
$temp_dir 网页服务器能够写入的临时文件目录 (/tmp)
$curl_path cURL binary 的目录, 如果没有cURL binary就设置为 false