• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

流方式实现多线程采集有关问题,请高手分析上

php 搞代码 3年前 (2022-01-23) 18次浏览 已收录 0个评论

流方式实现多线程采集问题,请高手分析下
采集内容速度慢,我一直很头大,最近在研究多线程采集,下面贴出比较代码,有两个问题,一是获取的结果长度有点不一致;二是效率是不是还不够高?大伙帮忙分析,测试!

PHP code

<!---ecms Code highlighting produced by Actipro CodeHighlighter (freeware)http://www.CodeHighlighter.com/--><?php$timeStart = microtimeFloat();function microtimeFloat() {    list($usec, $sec) = explode(" ", microtime());    return ((float)$usec + (float)$sec);}$data = '';$urls = array('http://www.tzksgs.com/news/2012-09/article-217.html', 'http://www.tzksgs.com/news/2012-09/article-219.html', 'http://www.tzksgs.com/news/2012-09/article-222.html');foreach($urls as $url){    echo strlen(file_get_contents($url)),'<br>';}$timeEnd = microtimeFloat();echo sprintf("Spend time: %s second(s)\n", $timeEnd - $timeStart),'<br>';$timeStart = microtimeFloat();$timeout = 30;$status = array();$retdata = array();$soc<i style="color:transparent">@本文来源gaodai$ma#com搞$代*码6网</i><b>搞代gaodaima码</b>kets = array();$userAgent = $_SERVER['HTTP_USER_AGENT'];foreach($urls as $id => $url) {    $tmp = parse_url($url);    $host = $tmp['host'];    $path = isset($tmp['path'])?$tmp['path']:'/';    empty($tmp['query']) or $path .= '?' . $tmp['query'];    if (empty($tmp['port'])) {        $port = $tmp['scheme'] == 'https' ? 443 : 80;    } else $port = $tmp['port'];    $fp = stream_socket_client("$host:$port", $errno, $errstr, 30);    if (!$fp) {        $status[$id] = "failed, $errno $errstr";    } else {        $status[$id] = "in progress";        $retdata[$id] = '';        $sockets[$id] = $fp;        fwrite($fp, "GET $path HTTP/1.1\r\nHost: $host\r\nUser-Agent: $userAgent\r\nConnection: Close\r\n\r\n");    }}// Now, wait for the results to come back inwhile (count($sockets)) {    $read = $write = $sockets;    //This is the magic function - explained below    if (stream_select($read, $write = null, $e = null, $timeout)) {        //readable sockets either have data for us, or are failed connection attempts        foreach ($read as $r) {            $id = array_search($r, $sockets);            $data = fread($r, 8192);            if (strlen($data) == 0) {                if ($status[$id] == "in progress") {                    $status[$id] = "failed to connect";                }                fclose($r);                unset($sockets[$id]);            } else {                $retdata[$id] .= $data;            }        }    }}foreach($retdata as $data){    $data = trim(substr($data, strpos($data, "\r\n\r\n") + 4));    echo strlen($data),'<br>';}$timeEnd = microtimeFloat();echo sprintf("Spend time: %s second(s)\n", $timeEnd - $timeStart);?>

——解决方案——————–
你可以尝试 curl_multi_…. 并发执行
这样可尽可能的减少 php 指令,至于楼上两位说的问题。绝不是php所能解决的

——解决方案——————–
当然,file_get_contents()是阻塞型的,所以如果是执行多个抓取任务,当然会慢。
而socket_*(), fsockopen(), stream_*()都是非阻塞的。
——解决方案——————–
慢到什么程度?

试下加上这个:

$context = stream_context_create(array(‘http’ => array(‘header’=>’Connection: close’)));
file_get_contents(“…..”,false,$context);


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:流方式实现多线程采集有关问题,请高手分析上
喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址