Screen scraping with Whidbey
A Commandline OPML importer for Bloglines
I wanted to play with HttpWebRequest and the like classes in Whidbey. So, I chose to write an application which allows me to load my OPML to Bloglines. Obviously, the easy answer would be to capture the stuff Bloglines is doing in “Import subscriptions” and try to reproduce that in code. That would have meant that I will basically use the <input type=’file’> element’s semantics to implement this code. While, that would have been challenging, being the masochist that I am, I decided to take the longest route.
Following is the result of my journey.
The plan was to
- Get the list of RSS URLs from OPML.
- Login using the same POST mechanism that the web site uses.
- Keep the CookieContainer around.
- Submit each RSS URL to subscribe via GET.
- In the resulting page, scrape out the folder options and the ID for the site you would like to subscribe to.
- Scrape the site id. This can be done by looking for the “siteid” in the resulting page.
- See if the RSS XML is in a folder in OPML structure.
- If yes, create or find the ID of the folder on Bloglines site. This can be done by scraping the <option>s within a “folder” selection dropdown.
- If no, just use top level folder.
- After that, stuff the data you have gotten from c and e and make up a POST and send it its merry way.
Did you think I was joking when I said I took the long route?
In .NET, it is pretty easy to code this stuff though. Between, Regex, HttpWebRequest and CookieContainer, most of the coding was straightforward.
I don’t want to paste the whole class that I ended up with here, but here are the key elements.
- To get the RSS URLS from OPML, I basically, used XPath to get the title for the parent element and then get all the underlying elements.
XPathNavigator nav = xml.CreateNavigator();
XPathNodeIterator itTop = nav.Select("/opml/body/outline/@title"); //Get the folder name
//I support only one folder since BlogLines does not support more than one folder (at least I have not been able to figure it out yet.)
while (itTop.MoveNext()){
String key = "";
XPathNavigator navInner = itTop.Current;
key = navInner.Value;
navInner.MoveToParent();
XPathNodeIterator it = navInner.Select(".//@xmlUrl");
ArrayList arr = new ArrayList();
while (it.MoveNext()){
arr.Add(it.Current.Value);
}
ht.Add(key, arr);
- To Login, it was straightforward, HttpWebRequest.Create () with the URL.
- To subscribe.
Get the first connection.
String url = "http://www.bloglines.com/sub?url=" + HttpUtility.UrlEncode(rssXML);
HttpWebRequest request = CreateRequest(url);
Scrape the siteid.
Regex regex = new Regex("\\/preview\\?siteid=([0-9]+)");
Match m = regex.Match(outStr);
String siteId = m.Groups[1].Value;
Scrape the folders and their ids.
String strOptions = "<select name=\"folder\" onChange=\"checkSub()\">";
int posIndex = outStr.LastIndexOf(strOptions);
String strRemaining = outStr.Substring(posIndex + strOptions.Length);
int lastPos = strRemaining.IndexOf("</select>");
string justOptionsList = strRemaining.Substring(0, lastPos - 1);
//I am proud of the following regex J
regex = new Regex("\\<option\\svalue=\\\"([0-9]+)\\\"\\s*[a-zA-Z]*\\>\\s*([a-zA-Z\\s0-9\\.]+)\\s*\\<\\/");
MatchCollection mc = regex.Matches(justOptionsList);
Hashtable ht = new Hashtable();
foreach (Match mTemp in mc){
for (int i = 1; i < mTemp.Groups.Count; i += 2){
ht.Add(mTemp.Groups[i + 1].Value, mTemp.Groups[i].Value);
}
}
Post it back to Bloglines with the prescribed format.
All this was easy enough. There were couple interesting discoveries that were made in the process though.
- Because of security changes, .NET 2.0, might fail a lot more often with “Protocol violation exception” then the previous versions. This is due to some of the malformed headers in the existing websites. Following app.config will fix that. It did for me at least. J useUnsafeHeaderParsing is new in 2.0. Follow this link in Beta feedback center for more info.
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<system.net>
<settings>
<httpWebRequest useUnsafeHeaderParsing = "true" />
</settings>
</system.net>
</configuration>
- Due to some unknown reasons, there are occurrences of “unexpected error during receive” starts to occur. This is most likely due to “Keep-alive”.
request.KeepAlive = false; //Seems to fix that.
That is all I know. The good thing is that this class now allows me to do all sorts of things with it.
- I can write a sync application for my OPML files and Bloglines. [I probably would not do this though since ‘import subscriptions’ route is a lot more efficient].
- I can subscribe to RSS through command line. So, If I scrape some other site with RSS through a CLI app, I can pipe it to this app to subscribe.
- I can make this part of my emacs tools. J
Oh, Thanks to Scott, I came to know of fiddler. It was the perfect tool to figure out the innards of the calls in order to figure this stuff out.