It all started with Matt accusing me of thinking I was Neal Stephenson. I threatened to write a gigantic novel with lots of made-up words (think Anathem - which was very good, BTW) and read it to him over dinner. Of course he challenged me to do so. We were at lunch, so it was easy to imply that my computer was writing it while I was away - stringing random sentences together by searching the internet… this is the point where I suddenly think - hey, that doesn’t sound that hard and go back to my desk and start playing with regular expressions.
So - with the project set to generate paragraphs from real sentences found on the internet, I went to work. It’s about 5 revisions later, and I’ve created a JSON/JSONP service that, given a URL, scans it for links and sentences and returns both, along with some basic sentence data. Parameters are passed in on the URL (url for the scan target, and optionally callback for JSONP). An example of calling the service follows:
http://artlogic.nfshost.com/ojw.cgi?url=http://james.kruth.org/&callback=myfunc
http://artlogic.nfshost.com/ojw/ojw.cgi?url=http://james.kruth.org/&callback=myfunc
It returns a JSON object in the following format:
{“status”: “success”,
“url”: “http://www.digg.com/”,
“links”: [“http://www.google.com”,
”http://www.yahoo.com”,
”http://www.reddit.com”],
“sentences”: [{“sentence”: “This is a test.”,
“words”: 4,
“avgwordlen”: 2.75},
{“sentence”: “Another test sentence.”,
“words”: 3,
“avgwordlen”: 6.33}]}
Unless you’ve run into problems, a status of success should always be the case. url is the URL that has just been scanned. links are the links that were found and sentences are the sentences that were found, along with the number of words in each sentence, and the average number of letters per word.
Since it’s JSONP, you can access this from your site. Please don’t excessively use my bandwidth and have fun!