November 04, 2004

Curious, George: Scraping websites I've got a VERY rudimentary understanding of Perl but, so far, I've been able to put together my website by using free scripts from sites like this one. Alas, my google-fu doesn't seem to be strong enough to find a script for what I want to do next.

I'm using this script to allow visitors to e-mail their friends about my site. I'd like to modify it so that, instead of e-mailing the actual URL of a page, it submits the URL to tinyurl.com, scrapes the new tiny-fied URL from the resulting webpage, and then e-mails that URL instead. It seems like it ought to be fairly easy to do this, but I can't find existing code that I can easily cut-and-paste, and after spending two days trying to figure out how to do it myself, my tiny brain aches. Is there an easy way to do this? Or is this one of those deceptively complex things that I should just leave to the professionals?

  • Assuming that it's on Linux/BSD and you have curl installed, this will find the TinyURL for a given URL. It's a free-standing script. If you need further help, shout. #!/usr/bin/env perl require CGI; $url = $ARGV[0]; $reply = `curl -s 'http://tinyurl.com/create.php?url=$url'`; # I haven't seen any tinyurl's longer than 6, so 7 should be OK: if ($reply =~ m!http://tinyurl\.com/[0-9a-z]{3,7}!) {     $tinyurl = $&;     print("$tinyurl\n"); } # Put some decent error handling here: else {     print("Error!\n"); } It might be better to use the LWP library instead of curl; this is a bit quick and dirty...
  • you would grab WWW::Shorten::TinyURL and install it. On the third line, right under "Use Sockets", put "use WWW::Shorten::TinyURL;" getting the URL that a visitor was at beforehand is a matter of checking the referer, which can be gotten with "$SITE_URL = makeashorterlink ($ENV("HTTP_REFERRER"));". You must make sure this environment variable contains a site in your own domain. Seriously. Otherwise people could use this script to promote their sites on your dime. Watch out. This script looks like it was written before CGI.pm was in existance, and it might have other exploitable problems that I couldn't find in brief. If you're really new to perl (does "use module;' make sense to you? Do you know how to install WWW::Shorten::TinyURL from CPAN?) you might considers looking at NMS scripts, the TFMail script in particular. If you'd like more help, just drop me a line. On preview, I'd suggest not using ThreeDayMonk's code because the "$reply..." line can be made to do unwholesome things on the server the script runs on (see perldoc perlsec for information on taint mode and why you cannot trust user supplied data). WWW::Shorten::TinyURL does exactly the same thing as his script anyway.
  • ThreeDayMonk and Boo (and also DangerIsMyMiddleName, who has replied by e-mail)--thanks. I'm still having some troubles. Three Day Monk, your script runs brilliantly from the command line, but I get the dreaded "Internal Server Error" message when I try to run it from my webbrowser. What am I doing wrong? Boo_Radley and DangerIsEtc, I do indeed just barely know how to install a module from CPAN--that's one of the things I've managed to teach myself over the past two days. I've now managed to install WWW::Shorten::TinyURL, but it seems to require Bundle::LWP . When I try to install that, I end up (after a long series of dependency installs) with the following message:
    Makefile:85: *** missing separator. Stop. /usr/bin/make -- NOT OK Running make test Can't test without successful make Running make install make had returned bad status, install seems impossible.
  • OK, I've poked around further, and the problem seems to be that, in order to install Bundle::LWP, I first need to install HTML::Parser. But trying to install HTML::Parsert results in the following:
    make: *** [test_dynamic] Error 29 /usr/bin/make test -- NOT OK Running make install make test had returned bad status, won't install without force
  • boo_radley - you are entirely correct about the security problem in that code. Of course, one should always sanitise user input before passing it to an external program. WWW::Shorten::TinyURL is interesting, though. Mind you, I do all my scripting in Ruby these days - my Perl is decidedly rusty.
  • jacobw - At a guess, the problem is that you don't have any development tools installed. Which OS/distribution are you using? And why did I write an apostrophe in "tinyurl's"? It shouldn't be there.
  • Acording to uname, I'm using Unix 2.4.21-20.EL.
  • jacobw - It sounds like you aren't the administrator. You will need compiler rights to install some Perl modules, so you'll have to ask your administators to grant you these. Alternatively, you could ask them to install the modules you need on a site-wide basis.
  • DangerIsMyMiddleName was kind enough to look at the complete lengthy error message, and he determined that the tests were being passed, but, for some reason, were coming out as negative. So, his suggested workaround was just to install the modules manually from my shell, rather than using "cpan> install". He also advised not running "make test" since it seemed to be generating inexplicable errors. This did the trick, and the relevent webpages are here, if you want to see the very simple result of all this effort. Thanks, all, for the help! I really appreciate it.
  • Heh, I like it!
  • Thanks, Tracicle!