Scraping the CoX Dev Digest

February 2nd, 2009

Even though I’m out of the game, I still browse the fora every now and then, and saw a thread about RSS on the dev digest and PHP. So I figured I’d post my code here, if anyone wanted it.

File: requestfile.function

< ?php
 function RequestFile($url) {
  $parts = parse_url($url);

  $fp = @fsockopen($parts['host'], 80, $errno, $errstr, 30);

  if(!$fp) {
   die("Error while requesting: ($errno) $errstr");
  }

  if(!isset($parts['path']) || '' == $parts['path']) {
   $parts['path'] = '/';
  }
  if(isset($parts['query']) && '' != $parts['query']) {
   $parts['path'] .= '?' . $parts['query'];
  }

  $nl = "\r\n";
  $out = "GET $parts[path] HTTP/1.0".$nl."Host: $parts[host]".$nl.'User-Agent: Sapph RSS Generator'.$nl;
   $out .= 'Accept: */*'.$nl.'Accept-Language: en-gb'.$nl.'Connection: close'.$nl.$nl;

  $result = array('headers'=>array(), 'status_version'=>'', 'status_code'=>'', 'status_phrase'=>'');

  fwrite($fp, $out);

  $inHead = true;
  while(!feof($fp)) {
   $line = fgets($fp, 4096);
   if('' == trim($line)) {
    $inHead = false;
    continue;
   }

   if($inHead) {
    if(preg_match("!^HTTP/([^\\s]+)\\s+([0-9]{3})\\s+(.*)\$!i", trim($line), $matches)) {
     $result['status_version'] = $matches[1];
     $result['status_code'] = $matches[2];
     $result['status_phrase'] = $matches[3];
    }
    else {
     list($key, $val) = explode(':', trim($line), 2);
     $result['headers'][$key] = trim($val);
    }
   }
  }

  fclose($fp);

  return($result);
 }
?>

file:devdigestparse.php

<?php
 $filename="http://boards.cityofheroes.com/showflat.php?Number=312062";
 include("requestfile.function");
 //$response = RequestFile($filename);
 $data = file_get_contents($filename);
 $count = substr_count($data, '<td><a href="http://boards.cityofheroes.com/showprofile.php?Cat=&User=');
 $poscurrent=strpos($data, '<td width="83%" class="lighttable">');
 $posend=strpos($data, '<table border="0" cellpadding="0" cellspacing="0" width="100%" align="left">');
 $datasubset = substr($data, $poscurrent, $posend-$poscurrent);
 if ($count>22) $count=22;
 include('closetags.function');
 for ($i=0;$i<$count;$i++) {
 	$devstart=strpos($datasubset, '<font color="#FF0000">')+22;
 	$devend=strpos($datasubset, '</font>');
 	$dev[$i]=substr($datasubset, $devstart, $devend-$devstart);
 	$urlstart=strpos($datasubset, '<a href="http://boards.cityofheroes.com/showflat.php')+9;
 	$urlend=strpos($datasubset, '">', $urlstart);
 	$urldev[$i]=str_replace('&', '&', substr($datasubset, $urlstart, $urlend-$urlstart));
 	$subjectstart=$urlend+2;
 	$subjectend=strpos($datasubset, '</a></td><td>', $subjectstart);
 	$subject[$i]=substr($datasubset, $subjectstart, $subjectend-$subjectstart);
 	$datestart=$subjectend+13;
 	$dateend=strpos($datasubset, '</td>', $datestart);
 	$datedev[$i]=substr($datasubset, $datestart, $dateend-$datestart);
 	$nextsubset = strpos($datasubset, '<td><a href="http://boards.cityofheroes.com/showprofile.php?Cat=&User=', $devstart);
 	$currlen = strlen($datasubset);
 	$datasubset = substr($datasubset, $nextsubset, $currlen-$nextsubset);
 	$urllen = strlen($urldev[$i]);
 	$poststart = strpos($urldev[$i], '#Post')+5;
 	$postid[$i] = substr($urldev[$i], $poststart, $urllen-$poststart);
 	if (isset($_GET['test'])) {
 	 $filename=$urldev[$i];
 	 //$response = RequestFile($filename);
 	 $data = file_get_contents($filename);
 	 $startmarker = '<a name="Post'.$postid[$i];
 	 $poststart = strpos($data, $startmarker);
 	 $poststart = strpos($data, '<font class="post">', $poststart);
 	 $postend = strpos($data, '--------------------', $poststart);
 	 $fullpost = substr($data, $poststart, $postend-$poststart);
 	 $postpointer = 0;
 	 //echo $poststart." : ".$postend."<br><hr><br>";
 	 while (strpos($fullpost, '<font class="post">', $postpointer)!==FALSE) {
 	  $poststart = strpos($fullpost, '<font class="post">', $postpointer);
 	  $postpointer = $poststart+19;
 	 }
 	 $killtags = array('<b>', '</b>', '<i>', '</i>', '<hr>', '<br><br><br>');
 	 $post[$i]=str_replace('</hr></br></br></br></hr>', '', closetags(substr($fullpost, 0, 300))).". . .";
 	 $post[$i]=str_replace($killtags, '', $post[$i]);
 	 echo $post[$i]."<br><hr><br>";;
  }
 }
?>

Note: this file does not actually create an XML/RSS/ATOM/whathaveyou feed. It scrapes when the page is called and prints the formatted results of the scrape in-browser. Modification to cache scrapes and write to a file in the RSS format du jure is left as an exercise for the reader. However, IMO, the hard work (the HTML parsing) has been done.



CC-GNU GPL

This software is licensed under the CC-GNU GPL version 2.0 or later.

Note: NCSoft Corporation, its parent companies, and its subsidiaries are hereby granted a transferable, non-revocable, perpetual license to use, modify, and distribute (or not) this software as they wish. This license is distinct from the GNU GPL license noted above.

WikiLeaks Shutdown? Madness? This is the INTERNET!

February 20th, 2008

Apparently, a court ordered the DNS server for WikiLeaks to delist them in some futile attempt at shutting down the site.

First, a primer on WikiLeaks. Rather than trying to explain it myself, I’ll quote from Wikipedia’s entry on WikiLeaks:

Wikileaks is a website that allows whistleblowers to anonymously release government and corporate documents, allegedly without possible retribution. Wikileaks operates on modified MediaWiki software, although it is independent from the Wikimedia Foundation. It claims that postings are untraceable by anyone attempting to do so.

It has a special focus on documents from countries where getting caught leaking such documents could mean a dark cell forever. This is a cause anyone can get behind - anyone who is not utterly corrupt, that is.

Next, a quick primer on DNS. I can handle this one without Wikipedia. ;p

Every device that is on the Internet has an IP address. Its a series of four numbers, between 0 and 255, separated by a dot. The IP address for the server sapph.org lives on is 216.104.33.126.

Every website you visit - without exception - has an IP address. But IP addresses aren’t the easiest things to remember. For instance, if I was telling you where the White House is, I could tell you that it is at N 38 ° 53′ 55.5″ W 77 ° 2′ 15.7″. But that’s cumbersome. Instead, I would tell you that it is at 1600 Pennsylvania Ave NW, Washington, DC 20502. But they both direct you to the same place. In the same way, DNS is a way to put an easy to remember name on an IP address. And it works like this (in theory):

If your computer had never been on the Internet before, and didn’t know the IP for anything, and you went to sapph.org, it would go to the root nameserver and ask “Do you know what the IP address for sapph.org is?”. The root nameserver would say “No, but here is the .org nameserver.” So your computer would go to the .org namerserver and ask the same question. The .org nameserver would say “No, but here is the nameserver for sapph.org.” So your computer would go to my host’s nameserver, and it would tell your computer my IP.

This is where the court order stepped in. The court order told that last nameserver in the chain “No matter who asks, don’t give them an answer for wikileaks.org”. So if that’s all you knew, you couldn’t get to it.

But that doesn’t mean the website is shut down. Its still chugging along happily at 88.80.13.160. It just doesn’t have it’s friendly name anymore. So if you already knew the IP address, you could still get to it.

So: 1. The government is stepping all over First Amendment rights to help some Swiss bank cover up some money laundering.
So: 2. The government’s doing it in a really dumb way.

Simple enough. I’ve set up http://wikileaks.sapph.org to redirect to WikiLeaks. How many people do I think are going to use it? Probably no one. No one reads this - hell, I haven’t written in it ever. But I felt like I had to get this out there.