Scraping the CoX Dev Digest
February 2nd, 2009Even though I’m out of the game, I still browse the fora every now and then, and saw a thread about RSS on the dev digest and PHP. So I figured I’d post my code here, if anyone wanted it.
File: requestfile.function
< ?php
function RequestFile($url) {
$parts = parse_url($url);
$fp = @fsockopen($parts['host'], 80, $errno, $errstr, 30);
if(!$fp) {
die("Error while requesting: ($errno) $errstr");
}
if(!isset($parts['path']) || '' == $parts['path']) {
$parts['path'] = '/';
}
if(isset($parts['query']) && '' != $parts['query']) {
$parts['path'] .= '?' . $parts['query'];
}
$nl = "\r\n";
$out = "GET $parts[path] HTTP/1.0".$nl."Host: $parts[host]".$nl.'User-Agent: Sapph RSS Generator'.$nl;
$out .= 'Accept: */*'.$nl.'Accept-Language: en-gb'.$nl.'Connection: close'.$nl.$nl;
$result = array('headers'=>array(), 'status_version'=>'', 'status_code'=>'', 'status_phrase'=>'');
fwrite($fp, $out);
$inHead = true;
while(!feof($fp)) {
$line = fgets($fp, 4096);
if('' == trim($line)) {
$inHead = false;
continue;
}
if($inHead) {
if(preg_match("!^HTTP/([^\\s]+)\\s+([0-9]{3})\\s+(.*)\$!i", trim($line), $matches)) {
$result['status_version'] = $matches[1];
$result['status_code'] = $matches[2];
$result['status_phrase'] = $matches[3];
}
else {
list($key, $val) = explode(':', trim($line), 2);
$result['headers'][$key] = trim($val);
}
}
}
fclose($fp);
return($result);
}
?>
file:devdigestparse.php
<?php
$filename="http://boards.cityofheroes.com/showflat.php?Number=312062";
include("requestfile.function");
//$response = RequestFile($filename);
$data = file_get_contents($filename);
$count = substr_count($data, '<td><a href="http://boards.cityofheroes.com/showprofile.php?Cat=&User=');
$poscurrent=strpos($data, '<td width="83%" class="lighttable">');
$posend=strpos($data, '<table border="0" cellpadding="0" cellspacing="0" width="100%" align="left">');
$datasubset = substr($data, $poscurrent, $posend-$poscurrent);
if ($count>22) $count=22;
include('closetags.function');
for ($i=0;$i<$count;$i++) {
$devstart=strpos($datasubset, '<font color="#FF0000">')+22;
$devend=strpos($datasubset, '</font>');
$dev[$i]=substr($datasubset, $devstart, $devend-$devstart);
$urlstart=strpos($datasubset, '<a href="http://boards.cityofheroes.com/showflat.php')+9;
$urlend=strpos($datasubset, '">', $urlstart);
$urldev[$i]=str_replace('&', '&', substr($datasubset, $urlstart, $urlend-$urlstart));
$subjectstart=$urlend+2;
$subjectend=strpos($datasubset, '</a></td><td>', $subjectstart);
$subject[$i]=substr($datasubset, $subjectstart, $subjectend-$subjectstart);
$datestart=$subjectend+13;
$dateend=strpos($datasubset, '</td>', $datestart);
$datedev[$i]=substr($datasubset, $datestart, $dateend-$datestart);
$nextsubset = strpos($datasubset, '<td><a href="http://boards.cityofheroes.com/showprofile.php?Cat=&User=', $devstart);
$currlen = strlen($datasubset);
$datasubset = substr($datasubset, $nextsubset, $currlen-$nextsubset);
$urllen = strlen($urldev[$i]);
$poststart = strpos($urldev[$i], '#Post')+5;
$postid[$i] = substr($urldev[$i], $poststart, $urllen-$poststart);
if (isset($_GET['test'])) {
$filename=$urldev[$i];
//$response = RequestFile($filename);
$data = file_get_contents($filename);
$startmarker = '<a name="Post'.$postid[$i];
$poststart = strpos($data, $startmarker);
$poststart = strpos($data, '<font class="post">', $poststart);
$postend = strpos($data, '--------------------', $poststart);
$fullpost = substr($data, $poststart, $postend-$poststart);
$postpointer = 0;
//echo $poststart." : ".$postend."<br><hr><br>";
while (strpos($fullpost, '<font class="post">', $postpointer)!==FALSE) {
$poststart = strpos($fullpost, '<font class="post">', $postpointer);
$postpointer = $poststart+19;
}
$killtags = array('<b>', '</b>', '<i>', '</i>', '<hr>', '<br><br><br>');
$post[$i]=str_replace('</hr></br></br></br></hr>', '', closetags(substr($fullpost, 0, 300))).". . .";
$post[$i]=str_replace($killtags, '', $post[$i]);
echo $post[$i]."<br><hr><br>";;
}
}
?>
Note: this file does not actually create an XML/RSS/ATOM/whathaveyou feed. It scrapes when the page is called and prints the formatted results of the scrape in-browser. Modification to cache scrapes and write to a file in the RSS format du jure is left as an exercise for the reader. However, IMO, the hard work (the HTML parsing) has been done.

This software is licensed under the CC-GNU GPL version 2.0 or later.
Note: NCSoft Corporation, its parent companies, and its subsidiaries are hereby granted a transferable, non-revocable, perpetual license to use, modify, and distribute (or not) this software as they wish. This license is distinct from the GNU GPL license noted above.