« May 2009 | Main | July 2009 »

10 posts from June 2009


Drupal: One Fix for 403-Forbidden Errors from RSS Aggregators

While working with a colleague today on one of his personal sites, we came across an issue with a Drupal-managed RSS feed where some RSS Aggregators were receiving 403-Forbidden status codes. After QUITE A BIT of poring through all the configuration and .htaccess files included with Centos and Drupal, our focus kept returning to the following section in his httpd.conf, which turned out to be the culprit (bolded below):

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteRule ^/$ /home/ [R]
RewriteCond %{HTTP_USER_AGENT} http://.*\.com
RewriteRule . - [F]
RewriteRule .* - [F]
RewriteRule ^/$ home/$1 [R=301,L]

What this states is that if the User-Agent connecting to a site contains a .com URL, forbid access. Examples of User-Agents that stuff a URL inside the user-agent string are: msnbot/2.0b (+http://search.msn.com/msnbot.htm) or in the case of my friend: Yahoo's FeedSeeker, which uses the following user-agent string: YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide).

Neither one of us could figure out why this particular Rewrite Condition was in placeā€”it seems like it is a bit too generic as a badbot killer. Getting rid of this condition and rule fixes the problem with RSS Aggregators getting a 403-Forbidden but does wind up leaving the site exposed to those bad robots and site rippers so it seems to make sense to replace the rule with something like the following:

RewriteEngine on
RewriteRule ^/$ /home/ [R]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
RewriteRule .* - [F]
RewriteRule ^/$ home/$1 [R=301,L]

The Rewrite Conditions above come from many, many different SEO-related links that resulted from a google search. Search your favorite search engine for "RewriteCond %{HTTP_USER_AGENT}" and you should get pages and pages of results.


JBoss: Clustered Node Startup Failures

As far as I can tell, JBoss clustering is based on functionality provided by another JBoss project called JGroups. We recently ran into an issue where half of our six identically configured application servers would simply not start. As the servers were all generated from the same base image, server configuration was not thought to be the culprit. All nodes were on the same subnet so we were a bit puzzled. In the logs on the servers, we same messages that looked like the following:

ERROR [org.jgroups.protocols.pbcast.GMS] [some_host:some_port] received view <= current view; discarding it (current vid: [some_host:some_port|4], new vid: [some_host:some_port|4])


WARN [org.jgroups.protocols.pbcast.NAKACK] [some_host:some_port (additional data: 19 bytes)] discarded message from non-member some_host:some_port (additional data: 19 bytes)

When setting up a cluster of jboss servers, even though the docs don't really require it, your network administrators will appreciate it when you place all these servers on the same subnet. This is because JGroups uses IP Multicast pings to maintain membership in the cluster and network administrators HATE IT when you multicast across subnets. When you have dual NICs in your servers set to fail on fault, it's really nice when the primary NICs on each box is plugged into the primary switch but it's even nicer when half of your boxes are plugged into switch A as their primary and the other half are plugged into Switch B as their primary. However, when half are plugged into one switch and the other half are plugged into another switch, you need to be able to pass multicast ip traffic between these two switches, which was the problem in our case. So, if you should happen to come across this condition, you might want to check to see if multicast IP is enabled on your switches and if your primary and secondary switches are passing multicast traffic between them.


The Great SSL Extended Validation Certificate Mystery

You know, these extended validation certificates really bug me--more so than they probably should but they really, really bug me. The premise behind them is easy enough to understand--we'll color your address bar green (or provide some other kind of green-hued, visual cue) to let your users know that you spent tons more money on the same level of encryption. Some sites have reported increased conversion rates which, in the minds of the site owners, more than makes up for the cost, so if you've bought them and you are happy with them, that's super.

I get a lot of hits to this blog where "extended validation" shows up somewhere in the keyword search and I have a question for my readers who also happen to be developers. Are extended validation certificates difficult to work with?  Does the slightest idiosyncrasy in markup on a page wreak havoc with them? Today's example is with Firefox 3.5 Preview, Internet Explorer 7, Safari 4.0, and the mozilla add-ons site.

Open https://addons.mozilla.org/en-US/firefox/ in one of these browsers--let's start with IE7. The site is encrypted using a GlobalSign Extended Validation certificate and before anyone in P.R. freaks, I'm not slamming any company in this post. In IE7, you get the green bar:


Displaying the Certificate's Extended Details though, you don't get something that any user on the Internet would probably find extremely helpful:  An answer to the question "Should I trust this site?"  Instead of popping up a nice little "Yes" message when clicking the link, you get a Microsoft Help page listing all the different ways that your address bar could be colored with each one stating ways how you could still not be protected.

Switching to Firefox 3.5 Preview, although this behavior existed in Beta 4 as well, instead of getting a green bar, you get a blue bar:

Is this a bug?  Is there something wrong with the page?  It doesn't appear to be the case that Firefox can't display EV certs, since my health insurer's site displays as expected.  (Update:  It appears to be a bug.  Other GlobalSign EV SSL certificate-using sites don't display right either.  Check out demo site:  https://ev.globalsign.com/ Update 2: This bug exists in Firefox 3.5 RC1 as well. I had opened up a bug request through bugzilla but it was closed as a duplicate).

Finally, I'm liking how Safari handles them--you can't really tell that an EV cert is being used unless you hover the mouse over the green Mozilla Corporation text next to the prominently displayed RSS button:


It's almost as if the safari developers are saying, "Yeah...we aren't too sure about these things either".

Now, let's switch back to IE7 since they so prominently display the issue and go to https://blogs.verisign.com/.  Again, I'm not picking on Verisign this time--just using their site to display the issue (and yes, I understand that one wouldn't normally try connecting to a blog over an encrypted channel--humor me!). At the start, everything looks fine:


Click on the link for Tim Callan's Web Blog, everything is still fine:


Go back and then click on the link for the new Web User Experience Blog, you get warned about a mix of SSL and non-SSL items on the page and the green bar vanishes although the site name didn't change:


So what's going on here?  Is there some absolute http URL in the HTML somewhere that is throwing off IE?  I don't really know and since this is not an electronic commerce site that I'm buying from (it's a blog site), it's not that big a deal but it does help illustrate my point that it seems like browsers don't really work well with EV certs yet.  Is whatever the cause of the problem on this blog something that is equally easy to perform on a site where visitors might be buying something from?  If so, do we now need to consider writing an Extended Validation Certificate-Using Web Site Markup Validation tool to make sure that the green bar always displays as expected?  I wouldn't want to do that without first knowing all the ways one can break them first--and I don't yet know all the ways one can break them.

UPDATE:  Today's (July 17, 2009) release of Firefox 3.5.1 appears to fix one problem I reported with GlobalSign's Extended Validation certificates so now the location bar displays green when connecting to GlobalSign's EV test site, (https://ev.globalsign.com/) but still doesn't display green on https://addons.mozilla.org/ (on my Mac at least).  This provides a good example of the basic problem I see with providing this kind of visual cue to end-users.  Both sites appear to be signed by the same CA certificate but one displays as expected and the other doesn't.  If I were to guess, I would think that there is something encrypted on the page protected by a different CA signed certificate or there is something on the page that is being delivered over HTTP by way of an absolute url.  I confess, I haven't figured out what it is yet.