Add XML News Feeds to Your Site

A lot of sites now offer news feeds in XML format that you can use to fetch their news - the most common format is RSS (Rich Site Summary). With the help of PHP you can parse such a feed, even without XSLT support, and display it on your site.

How to get the feed

To get the feed, you can use PHP's fopen function, which can get files over HTTP besides from the local filesystem. The only thing needed is actually to use the http:// prefix followed by the host name and the full path to the file.

PHP will internally do a GET request using HTTP 1.0, it will also send a Host header needed by name virtual hosts.

<?php
$feed = 'http://slashdot.org/slashdot.rdf';

ini_set('allow_url_fopen', true);
$fp = fopen($feed, 'r');
$xml = '';
while (!feof($fp)) {
	$xml .= fread($fp, 128);
}
fclose($fp);

This example shows how to get the Slashdot news feed.

We have explicitly set allow_url_open which is required by PHP if you want to fetch files over HTTP with fopen.

A much better idea, in terms of improved error handling, would be to use fsockopen, which was specifically developed for socket connections.

<?php
$host = 'slashdot.org';
$uri = 'slashdot.rdf';

$fp = fsockopen($host, 80, $errno, $errstr, 20);

if (!$fp) {
	die("Network error: $errstr ($errno)");
} else {
	$xml = '';
	fputs($fp, "GET /$uri HTTP/1.0rnHost: $hostrnrn");
	while (!feof($fp)) {
		$xml .= fgets($fp, 128);
	}
	fclose ($fp);
}

We're actually doing exactly the same thing, although we can choose the timeout this time with the last parameter to fsockopen, 20 seconds in our case.

Note: Actually we get the HTTP headers also, but that's not an issue.

The other advantage is we can print nicer error messages than the ugly ones produces by PHP.

Parsing the XML feed

Now that we have the data available as a PHP string, the next step is to parse it.

As I mentioned in the introduction this method allows us to handle the news feed even without XSL support built-in PHP. I have used a slightly modified version of the untag function from Using XML with PHP without any Apache changes on evolt.

Caveats for the untag function instead of real XSLT transformer mentioned in the article:

  • empty tags being closed in themselves like <empty/>
  • attributes like <empty isneeded="true" />

I still haven't seen a feed that uses something like that, so I hope it is OK. Of course if you know about such a feed it would help if you share it with me.

function untag($string, $tag) {
	$tmpval = array();
	$preg = "|<$tag>(.*?)</$tag>|s";

	preg_match_all($preg, $string, $tags);
	foreach ($tags[1] as $tmpcont){
		$tmpval[] = $tmpcont;
	}
	return $tmpval;
}

The function simply uses Perl Compatible Regular Expressions to extract the contents of all XML elements with the given name and return them as an array.

Sample from Slashdot's feed

<item>
<title>Do Cell Phones Make Us Stupid?</title>
<link>http://slashdot.org/article.pl?sid=02/09/03/1429222</link>
</item>

<item>
<title>Slashback: Google, Prince, Bayesian</title>
<link>http://slashdot.org/article.pl?sid=02/09/03/0138216</link>
</item>

As you can see the XML code is very simple, of course there is some additional data, but it is meta information about the feed, like title, description, logo, etc.

Transform it into HTML

Now we get to the point where we need to extract that information from the feed and transform it into HTML, which browsers can handle.

$items = untag($xml, 'item');

$html = '<p>';
foreach ($items as $item) {
	$title = untag($item, 'title');
	$link = untag($item, 'link');

	$html .= '<a href="' . $link[0] . '">' . $title[0] . "</a><br />n";
}
$html .= '</p>';

echo $html;

We are using untag to get an array of all items, then we loop through it extracting the title and link of every single piece of news.

After we have the title and link we simply create an HTML link to the article using the title text we have. Of course, if you don't like the generated HTML you can use your own.

Some feeds also have a description element with a longer description, it can be set as the title attribute of the anchor element. For a full list of elements that you can find in feeds look at the DTD (Document Type Definition) associated with the feed.

You can also display, for example only 3 news items, if you change the foreach with a different loop construct.

Comments

Newbie help

>>You can also display, for example only 3 news items, if you change the foreach with a different loop construct.
How exactly would you write that? Please help.
- Matt

RE: Newbie help

there are loadsa ways of only showing a limited number of items..:
i.e.
$id=0;
foreach ($items as $item){
if($id=="5") {
break;
}
$id++;
$title = untag($item, 'title');
$link = untag($item, 'link');
rest of code....

link

Hi there,
Thanks for the use of this script. The problem I have is that the xml feed I've got has links that look like this:
http%3A%2F%2Fwww%2Enzherald%2Eco%2Enz%2Fstorydisplay%2Ecfm%3FstoryID%3D3523776%26thesection %3Dtechnology%26thesubsection%3Dgeneral
Can the parser be modified to deal with this?

xml news feedd

works great exept it whipes out the code that comes after it iused an inclde function to iclude thx php script into my html when i veiw the page everything below the php is not seen by th browser huh if you can help thanks

unwanted tags

I am wanting to do more than just show the links and title....
I want to display the title as a link and then the description underneath it. This is completely allowed by the site feeding the rss.
My problem lies in the fact that some descriptions will have <p> tag in them....this causes that description to not be displayed....how do i correct?

re: news feed

sam: set up your link tags as:
a target=_blank href=[html link here]
the target=_blank causes the link to open in a new window when clicked.
john: unwanted tags. just process the descriptions prior to displaying them. copy all the xml into a char string until you encounter the tag, skip past it, continue copying.

XML news feed

It don't work :-)
I get the error ...
Warning: fopen(http://slashdot.org/slashdot.rdf): failed to open stream: Bad file descriptor in C:\htdocs\test.php on line 6

What about removing duplicates?

Moreover has a big list of news headline feeds:
http://w.moreover.com/categories/category_list_xml.html
Unfortunately Moreover is atrocious at weeding out identical stories. Using the feeds they provide, how might I go about discarding any <article> where the <headline_text> is not unique?
Here's an example of one article:
- <article id="_113573634">
<url>http://c.moreover.com/click/here.pl?x113573634</url>
<headline_text>British Airways to Resume Flight to D.C</headline_text>
<source>AP via New York Post</source>
<media_type>text</media_type>
<cluster>UK business news</cluster>
<tagline />
<document_url>http://breakingnews.nypost.com</document_url>
<harvest_time>Jan 3 2004 2:34PM</harvest_time>
<access_registration />
<access_status />
</article>

Thanks

What a great article, it really helped me get an XML document into a CSV file.

thanks

Hey thanks for the awesome code
if anyone is interested, here's mine with some small changes (messy but works)
grabs 3 random headlines, puts the full headline in the title attribute of the a tag and shortens the headline if necessary (ie length is > 20) with '...' on the end
$feed = 'http://www.some.com.au/FEED/xml.html';
ini_set('allow_url_fopen', true);
$fp = fopen($feed, 'r');
$xml = '';
while (!feof($fp)) {
$xml .= fread($fp, 128);
}
fclose($fp);
function untag($string, $tag) {
$tmpval = array();
$preg = "|<$tag>(.*?)</$tag>|s";
preg_match_all($preg, $string, $tags);
foreach ($tags[1] as $tmpcont){
$tmpval[] = $tmpcont;
}
return $tmpval;
}
$items = untag($xml, 'resource');
$max=count($items)-3;
$start=rand(0,$max);
$html = '';
for($i=$start;$i<$start+3;$i++){
$title = untag($items[$i], 'headline');
$link = untag($items[$i], 'linkto');
$html .= '<a href="' . $link[0] . '" title="' . $title[0] . '">';
$len=strlen($title[0]);
if($len>20){
$words=explode(' ',$title[0]);
$title[0]='';
for($w=0;$w<count($words);$w++){
$title[0] .= $words[$w] . ' ';
if(strlen($title[0])>20){
$title[0] .= '...';
break;
}
}
}
$html .= $title[0] . "</a><br />\n";
}
echo $html;
cheers

A particular subject

Is there any way to get a particular area like all the articles on the job market or the housing market?

adding a date

i'm fiddling with it, but i'm having a hard time trying to get the date inserted in the html string as well...any ideas?

limiting results?

is there a way to limit the amount of feeds that come through, e.g I only want the first 5 news headlines

limited number of items

foreach ($items as $item) {

$description = untag ($item, 'description');
$title = untag($item, 'title');
$link = untag($item, 'link');
$html .= '<a href="' . $link[0] . '">' . $title[0] . "</a><br />\n";
$html2 .= strip_tags($description[0]);

$descricao=explode('(&lt;i&gt;',$description[0]);
//limited number of items
$a++;
if($a==4) {
exit; }
//end limited number of items
print $descricao[0];
print '<br><a href="' . $link[0] . '">' . $title[0] . "</a><br />\n";
}

Making the HTML part from the XML Code

I want to know what the HTML code would be for the XML code at http://rss.groups.yahoo.com/group/nascar_pitstop/rss. How would you make the code?

W00t

Very nice work. I've been toying with this for hours here at work in textpad (it's so difficult to pull these things together when you have no place to debug it either!)
This article has really helped me out. Thanks:)

Limited items and description

I was grabbing the top five stories from Yahoo and wanted to show the first 100 characters of the description. I wrote a quick function that starts at the 100th character and loops until it finds a space and a suitable break point.
function strclnup($description) {
for($i = 100; $i <= 150; $i++):
if (substr($description,$i,1) == " ") {
return substr($description,0,$i);
break;
}
endfor;
}
$html = '<p>';
$id=0;
foreach ($items as $item){
if($id=="5") {
break;
}
$id++;
$title = untag($item, 'title');
$link = untag($item, 'link');
$descr = untag($item, 'description');
$html .= "<a href='" . $link[0] . "' target='_new'>" . $title[0] . "</a><br>" . strclnup($descr[0]) . "...<br>\n";
}
$html .= "</p>";

Yahoo News Search

Hi all, my name is Daniele Leone, i read someone ask for reading news from Yahoo!. I just built a simple but powerfull script that make you able to search in yahoo news feeds. Try the demo here http://www.danieleleone.com and download it for free if you like ! ;-) Bye, Daniele

Another way Shorten Description

What a great XML feed routine, thanks! I use it succesfully taking feeds from Silicon.com for my site www.FirstAidforComputers.com
By the way SteveT I use a routine to shorten descriptions as follows:
$sh = array_slice(explode(" ",$description),0,100);
$shortened = implode(" ",$sh);
Hope it comes in useful.

An alternative shorten function

Instead of exploding the string in to an array just to convert it back, a similar thing can be done using a regular expression all in one line:
$title = preg_replace("/(.{20}.*)\s.*$/","\${1}...",$title);
Here the code trims the title to 20 characters but keeps the last word as a complete word and places an elipse ('...') at the end if the length is over this.

password protected feeds

how do I grab a password protected feed if I know the password?

Another cock teaser

If it seems to good to be true, it probably isn't. If this script worked, it would have been great...
I just get a blank page. I'm using PHP5. Does anybody knows why it's not working?

Thanks Martin

I used your second method and it gives the links to the newsfeeds now.
Thanks a million. I stand corrected. This is the real thing.

pubDate better way?

I used thiscode to display the pubDate localtime, but there must be a better way
$pubDate = untag($item, 'pubDate');
//Tue, 02-11-2004 02:10:49 GMT
//Thu, 14 Oct 2004 08:21:00 GMT
$Section = substr($pubDate[0], 5);
$day = substr($Section, 0, 2);
$mth = substr($Section, 3, 2);
$match = preg_replace("/.[0-9]/", "match", $mth);
if( $match == "match"){
$yrs = substr($Section, 6, 4);
$hrs = substr($Section, 11, 2);
$min = substr($Section, 14, 2);
$sec = substr($Section, 17, 2);
$newpubDate = mktime( $hrs, $min, $sec, $mth, $day, $yrs, 1);
$newpubDate = date("D, j M \a\t g:i a", $newpubDate);
}
else
{
$mth = substr($Section, 3, 3);
$mth = strtotime("10 $mth 2000");
$mth = date("m",$mth);
$yrs = substr($Section, 7, 4);
$hrs = substr($Section, 12, 2);
$min = substr($Section, 15, 2);
$sec = substr($Section, 18, 2);
$newpubDate = mktime( $hrs, $min, $sec, $mth, $day, $yrs, 1);
$newpubDate = date("D, j M \a\t g:i a", $newpubDate);
}
BTW excellent script, easy to impliment, and no php_extensions required!

Tags with additional content

As Wasabi mentions in the post above, tags with content inside of them do in fact exist now. I've encountered one at this feed (http://www.core77.com/corehome/index.xml). Specifically they embed some RDF data in the XML tag for Item. The item tag reads like this:
<item rdf:about="http://www.example.com/stories/001.html">
The easiest answer would seem that I should write a custom function JUST to handle the parsing of the $xml variable to extract the information into the $items array. The problem is, each of the articles listed has a different URL embedded inside its <item> tag... however I don't know how to write wildcards in PHP!!
Golly gee whillikers it would sure be neat to know how to handle this....

Tags with additional content (hack)

well I "kind of" solved this problem. First off, I made a duplicate of the untag function and gave it a new name (lets call it 'workaround_link'). Turns out that you really don't NEED to include the ENTIRE opening tag for the variable $preg, just the closing carat of the tag. So it could look like this:
$preg = "|>(.*?)</$tag>|s"
All I know is that it WORKS!!!! (and that I can now rest)