Hans Hübner wrote:
It means that googlebot presented a session identifier string as a hunchentoot-session parameter that is not valid. You are propably using sessions very frequently and the Google crawler managed to hit one of the URLs of your server that starts a session. As the crawler did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back to attaching the session identifier to all URLs in the outgoing HTML as a parameter. The crawler saved the URLs it saw including the session identifier and now tries to crawl using these identifiers, which are propably old and no longer valid.
First off, I would recommend that you switch of URL-REWRITE (http://weitz.de/hunchentoot/#*rewrite-for-session-urls*). I am not using it myself precisely because it confuses simple crawlers. If a user does not accept the cookies my site sends, they will not be able to use it with sessions. For me, this has never been a problem. This will propably not help you with your current problem, but it will make things easier in the future.
In general, crawlers do not support cookies or session ids in GET parameters. Thus, if you want to support crawlers, you need to make them work without sessions. Note that if you just do nothing except switching off URL-REWRITE; every request from a crawler will create a new session. This may or may not be a problem.
I guess that Google now has a lot of your URLs it wants to crawl because the different session identifiers made it think that all of them are pointing to different resource. I am kind of wondering whether that is standard googlebot behaviour.
Lastly, I would vote for switching off URL-REWRITE by default.
Thanks for the excellent explanation. It fits all the available facts. I've turned off *REWRITE-FOR-SESSION-URLS* so presumably, google should eventually out that the URL's it has are bad and drop them in favor of the sessionless ones (I hope).
I switched to a non-googlebotted site to experiment with and for some reason even when I'm not using sessions, I see a message about "No session for session identifier..." when I browse a page myself. I cleared my cache, here's an example:
[2008-07-04 14:46:34 [WARNING]] Fake session identifier '1:D5C66E2968BE2162C3164 B39B9029F13' (User-Agent: 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14 ) Gecko/20080404 Iceweasel/2.0.0.14 (Debian-2.0.0.14-2)', IP: '127.0.0.1')
That error message corresponds to this access log entry and this header output:
127.0.0.1 (192.168.1.1) - [2008-07-04 14:46:34] "GET / HTTP/1.1" 200 9195 "-" "M ozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) Gecko/20080404 Iceweasel/2 .0.0.14 (Debian-2.0.0.14-2)"
GET / HTTP/1.1
Host: 127.0.0.1:4242
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) Gecko/20080404 Iceweasel/2.0.0.14\ (Debian-2.0.0.14-2)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/\ *;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Cookie: hunchentoot-session=1%3AD5C66E2968BE2162C3164B39B9029F13
Max-Forwards: 10
X-Forwarded-For: 192.168.1.1
X-Forwarded-Host: cunningham.homeip.net
X-Forwarded-Server: test.com
Connection: Keep-Alive
HTTP/1.1 200 OK
Content-Length: 9195^M
Date: Fri, 04 Jul 2008 21:46:34 GMT^M Server: Hunchentoot 1.0.0^M
Keep-Alive: timeout=20^M
Connection: Keep-Alive^M
Content-Type: text/html; charset=iso-8859-1^M
--Jeff