In this post, i’m going to write a simple explanation / basic example about using Yahoo Pipes to fetch a webpage (you are free to use any pages you want assuming they allow Yahoo Pipes) and then create a RSS Feed from it so you can read it on your favorite rss reader
As an example, in this post i’m going to give an example of creating RSS Feed from HorribleSubs website (horriblesubs.org) that i’ve been using (for myself only) so i can keep track on their Gintama release easily (i read that they’re planning on doing a total makeover of their site so i guess it’s okay to use them as an example)
Before anything else, please see the source of the pipe used in this example (you need to log in to Yahoo first) because you’ll need to be logged in to Yahoo to see or create a new pipe
Update 1: Here’s the updated version of the pipe which is used for their new domain (horriblesubs.info) and their new site design. The old pipe is left there in case you want to compare the old pipe with the updated pipe and also because the screenshot that is used here is based from the old pipe. As you can see, the process itself is still the same but with some adjustments
Update 2: As of June 2012 it is still working (last time i checked their website because ever since Gintama end, i don’t check it anymore), and i noticed that they now published their own RSS Feed so if you only use this simple example just to see their rss feed, i recommend to grab their official feed instead because i saw on their page that they’re planning on redesigning their website
Update 3: As of November 2012 the above pipe (the pipe in update 1) is still working (if you need to see a working example), and i made some changes in this post by including images
1. First thing you need to do is obviously examine the page source you’re going to fetch to see where you should start cutting and how the items separated
For example, in this case the content i’m going to pick is wrapped within a div ( <div id="tab3" class="boxcontent"> ) and the items is separated by <br/> tag and so i just need to write that into the fetch page module
and here is what it looks like on the Yahoo Pipes side
2. At this part, i’m filtering the content from unneded html tags and content that i deemed unnecessary by using the regex / regular expression module as you can see on the pipe source. But because there’s no single regex rule to rule them all (because it depend on your needs), you’ll need to experiment by yourself at the regex parts
3. And now, so i can process each item separately, i’m mapping the previously cleaned up content as title, description, and link which is going to be used for the RSS Feed title, description and link respectively by using the rename module
and here’s the output
4. Once again the regex module is used and this time i’m using it to clean-up the html tags in the title (to differentiate it from the description) and the link so it gives you the target url only (in this case it is the torrent link) so when you click on the title from your RSS Reader you’ll go to the target url directly (note: see the output difference between below output image and the above image)
and here’s the output
5. Finally connect it to the pipe output and to get it as RSS, you just need to copy the Get as RSS link from your pipe to display it as RSS Feed and done
Also because there’s a usage limits imposed by Yahoo Pipes as quoted below
200 runs (of a given Pipe) in 10 minutes
200 runs (of any Pipe) from an IP in 10 minutes
If you exceed the 200 runs in a 10 minute block, your Pipe will be 999’ed for a hour.
You should make sure to cache your output before using it (unless perhaps you’re the only person that use the pipe you’ve created though it’s still better to cache it)
Thanks for the tutorial. FYI there’s a new way to fetch the exact HTML content you need using XPath. It makes things so much easier (and generally more reliable) than trying to extract using the old BEGIN and END fields.
THis is great and also the alternative of feed43 for scraping web news.. Thanks for the article.
Regards,
D
chelsea news