Using Cron with LWP::Simple and XML::RSS to retrieve news feeds

Originally published on March 24, 2003 when the war in Iraq was heating up and I found direct links to popular RSS news feeds were effecting the speed in which pages loaded on a friend’s blog whom I help maintain. I’m re-posting this article for reasons that will become obvious later this week. Until then, enjoy this “Spidering Hack!-)”

Adding some syndicated news feeds is a nice way of adding some compelling content to your site.

The problem is that sometimes the news feed gets overrun during heavy news days, go offline and/or suffers a host of other connectivity issues that make YOUR site load slow because the software holds your user hostage while the feed retrieval portion of the application has to wait to timeout. You see this alot with PHPNuke and PostNuke sites.

A simple way around this problem is to use a program that periodically retrieves the feed, slices-n-dices and effectively caches it into an easy to include file on your host. Doing this achieves five goals:

  1. user page loads are not penalized when feeds go down
  2. failures to connect do not harm the existing include file
  3. multiple attempts to read the feed to not penalize user
  4. feed can be mirrored for local/private use
  5. content can be formatted to taste

Below is a little program I wrote Thursday to grab news feeds from an AP Wire I found via Scripting.com for inclusion on a the website of a friend who makes his living in the political area.

Using the following CRONTAB syntax, the program is executed every 30 minutes:
30 * * * * /home/YOURPATH/getap.pl>/dev/null

The nice thing about this approach is that this particular feed does “get busy” from time to time and at one point on Friday went offline. My users did not notice because in most cases, I was able to get by the “busy signal” on the 2nd or 3rd attempt out of 10. In the case where the feed site went offline, my users merely viewed and older include file without interruption or delay.

Anyway, since I haven’t posted anything worthwhile in the past few days, I figured this was a good penance:

[perl]
#!/usr/bin/perl -w
# ———————————————————————–
# copyright Dean Peters © 2003 – all rights reserved
# http://www.HealYourChurchWebSite.com
# ———————————————————————–
#
# getap.pl is free software. You can redistribute and modify it
# freely without any consent of the developer, Dean Peters, if and
# only if the following conditions are met:
#
# (a) The copyright info and links in the headers remains intact.
# (b) The purpose of distribution or modification is non-commercial.
#
# Commercial distribution of this product without a written
# permission from Dean Peters is strictly prohibited.
# This script is provided on an as-is basis, without any warranty.
# The author does not take any responsibility for any damage or
# loss of data that may occur from use of this script.
#
# You may refer to our general terms & conditions for clarification:
# http://www.healyourchurchwebsite.com/archives/000002.shtml
#
# For more info. about this code, please refer to the following article:
# http://www.healyourchurchwebsite.com/archives/000760.shtml
#
# combine this code with crontab for best results, e.g.:
# 30 * * * * /home/YOURPATH/getap.pl>/dev/null
#
# ———————————————————————–
use XML::RSS;
use LWP::Simple;
# get content from feed — using 10 attempts
my $content = getFeed("http://www.goupstate.com/apps/pbcs.dll/section?Category=RSS04&mime=xml", 10);

# save off feed to a file — make sure you have write access to file or directory
saveFeed($content, "newsfeed.xml");

# create customized output
my $output = createOutput($content, 8);

# save it
saveFeed($output, "newsfeed.inc.php");
sub getFeed {
my ($url, $attempts) = @_;
my $lc = 0; # loop count
my $content;
while($lc $outfile") || die("Cannot Open File $outfile");
print OUT $content;
close(OUT);
}
sub createOutput {
my ($content, $feedcount) = @_;

# create new instance of XML::RSS
my $rss = new XML::RSS;

# parse the RSS content into an output string to be saved at end of parsing
$rss->parse($content);
my $title = $rss->{‘channel’}->{‘title’};
my $output = "GoUpstate/AP NewsWire\n";
my $i = 0;
foreach my $item (@{$rss->{‘items’}}) {
next unless defined($item->{‘title’}) && defined($item->{‘link’});
$i += 1;
next if $i > $feedcount;
$output .= "<a>{‘link’}\"&gt;$item-&gt;{‘title’}</a>\n";
}

# if a copyright &amp; link exists then post it
my $copyright = $rss-&gt;{‘channel’}-&gt;{‘copyright’};
my $link = $rss-&gt;{‘channel’}-&gt;{‘link’};
my $description = $rss-&gt;{‘channel’}-&gt;{‘description’};
$output .= " <a>$copyright</a>\n" if($copyright &amp;&amp; $link);
$output .= "";
return $output;
}
[/perl]

Of course, now I need to go ahead and practice what I preach and do the same here!