Re: [hunchentoot-devel] googlebot revisitation rate excessive?

4 Jul 2008

      Hans Hübner wrote:
...
It means that googlebot presented a session identifier string as a
hunchentoot-session parameter that is not valid.  You are propably
using sessions very frequently and the Google crawler managed to hit
one of the URLs of your server that starts a session.  As the crawler
did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back
to attaching the session identifier to all URLs in the outgoing HTML
as a parameter.  The crawler saved the URLs it saw including the
session identifier and now tries to crawl using these identifiers,
which are propably old and no longer valid.
First off, I would recommend that you switch of URL-REWRITE
(http://weitz.de/hunchentoot/#*rewrite-for-session-urls*).  I am not
using it myself precisely because it confuses simple crawlers.  If a
user does not accept the cookies my site sends, they will not be able
to use it with sessions.  For me, this has never been a problem.  This
will propably not help you with your current problem, but it will make
things easier in the future.
In general, crawlers do not support cookies or session ids in GET
parameters.  Thus, if you want to support crawlers, you need to make
them work without sessions.  Note that if you just do nothing except
switching off URL-REWRITE; every request from a crawler will create a
new session.  This may or may not be a problem.
I guess that Google now has a lot of your URLs it wants to crawl
because the different session identifiers made it think that all of
them are pointing to different resource.  I am kind of wondering
whether that is standard googlebot behaviour.
Lastly, I would vote for switching off URL-REWRITE by default.
Thanks for the excellent explanation. It fits all the available facts. 
I've turned off *REWRITE-FOR-SESSION-URLS* so presumably, google should 
eventually out that the URL's it has are bad and drop them in favor of 
the sessionless ones (I hope).

I switched to a non-googlebotted site to experiment with and for some 
reason even when I'm not using sessions, I see a message about "No 
session for session identifier..." when I browse a page myself. I 
cleared my cache, here's an example:

[2008-07-04 14:46:34 [WARNING]] Fake session identifier 
'1:D5C66E2968BE2162C3164
B39B9029F13' (User-Agent: 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; 
rv:1.8.1.14
) Gecko/20080404 Iceweasel/2.0.0.14 (Debian-2.0.0.14-2)', IP: '127.0.0.1')

That error message corresponds to this access log entry and this header 
output:

127.0.0.1 (192.168.1.1) - [2008-07-04 14:46:34] "GET / HTTP/1.1" 200 
9195 "-" "M
ozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) Gecko/20080404 
Iceweasel/2
.0.0.14 (Debian-2.0.0.14-2)"

GET / 
HTTP/1.1                                                                                       

Host: 
127.0.0.1:4242                                                                                 

User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) 
Gecko/20080404 Iceweasel/2.0.0.14\
 (Debian-2.0.0.14-2)                                                                                 

Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/\
*;q=0.5                                                                                              

Accept-Language: 
en-us,en;q=0.5                                                                      

Accept-Encoding: 
gzip,deflate                                                                        

Accept-Charset: 
ISO-8859-1,utf-8;q=0.7,*;q=0.7                                                       

Cookie: 
hunchentoot-session=1%3AD5C66E2968BE2162C3164B39B9029F13                                     

Max-Forwards: 
10                                                                                     

X-Forwarded-For: 
192.168.1.1                                                                         

X-Forwarded-Host: 
cunningham.homeip.net                                                              

X-Forwarded-Server: 
test.com                                                                         

Connection: 
Keep-Alive                                                                               

HTTP/1.1 200 
OK                                                                                      

Content-Length: 
9195^M                                                                               

Date: Fri, 04 Jul 2008 21:46:34 
GMT^M                                                               
Server: Hunchentoot 
1.0.0^M                                                                          

Keep-Alive: 
timeout=20^M                                                                             

Connection: 
Keep-Alive^M                                                                             

Content-Type: text/html; 
charset=iso-8859-1^M                                                       

--Jeff

Re: [hunchentoot-devel] googlebot revisitation rate excessive?

Jeff Cunningham