On 7/4/08, Jeff Cunningham jeffrey@cunningham.net wrote:
Before I block them altogether, there is one thing I don't understand that I'm hoping someone can explain to me. What does it mean exactly when I get a "No session for session identifier" INFO message in my error_log? There is one of these for each of the Googlebot hits.
It means that googlebot presented a session identifier string as a hunchentoot-session parameter that is not valid. You are propably using sessions very frequently and the Google crawler managed to hit one of the URLs of your server that starts a session. As the crawler did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back to attaching the session identifier to all URLs in the outgoing HTML as a parameter. The crawler saved the URLs it saw including the session identifier and now tries to crawl using these identifiers, which are propably old and no longer valid.
First off, I would recommend that you switch of URL-REWRITE (http://weitz.de/hunchentoot/#*rewrite-for-session-urls*). I am not using it myself precisely because it confuses simple crawlers. If a user does not accept the cookies my site sends, they will not be able to use it with sessions. For me, this has never been a problem. This will propably not help you with your current problem, but it will make things easier in the future.
In general, crawlers do not support cookies or session ids in GET parameters. Thus, if you want to support crawlers, you need to make them work without sessions. Note that if you just do nothing except switching off URL-REWRITE; every request from a crawler will create a new session. This may or may not be a problem.
I guess that Google now has a lot of your URLs it wants to crawl because the different session identifiers made it think that all of them are pointing to different resource. I am kind of wondering whether that is standard googlebot behaviour.
Lastly, I would vote for switching off URL-REWRITE by default.
-Hans