Date: Tue, 30 Mar 2004 12:12:58 -0600 (CST) From: Gilles Detillieux To: Toby Thain Cc: David Adams , "ht://Dig mailing list" Subject: Re: [htdig] query parameters should be ignored by extension filter? - PATCH for 3.1.6 Last week, I wrote: > According to David Adams: > > I am also using ht://Dig version 3.1.6 and for me it IS indexing URLs like > > > > http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg > > > > even though I have .jpg in my bad_extensions: list. > > Actually, I find this surprising. Upon looking at the code that handles > bad_extensions, in both 3.1.6 and 3.2.0b5, it seems to me that there is > indeed a bug in the way htdig locates filename extensions in URLs, as > Toby described. Can you confirm that you're running vanilla 3.1.6 with > no patches to htdig/Retriever.cc which might correct this bug? > > The fix to the code should be pretty simple, but I haven't had the time > to sit down and stare at it long enough to get the fix coded yet. I'll > try to get around to it by Friday, so it'll be in the next development > snapshot for the 3.2 betas, and posted to the list. OK, last week got a bit crazy, so I wrote the patch yesterday afternoon, just before the end of my work day. Here it is. Apply it in your main 3.1.6 source directory using "patch -p0 < this-message-file". Please let me know if it solves the problem for you and/or causes others. I've made sure the code compiles with the patch, but haven't tested it beyond that. Thanks. --- htdig/Retriever.cc.orig 2002-01-25 07:44:49.000000000 -0600 +++ htdig/Retriever.cc 2004-03-29 17:40:07.000000000 -0600 @@ -711,16 +711,17 @@ Retriever::IsValidURL(char *u) // // See if the path extension is in the list of invalid ones // - char *ext = strrchr(url, '.'); + String urlpath = url.get(); + int parm = urlpath.indexOf('?'); // chop off URL parameter + if (parm >= 0) + urlpath.chop(urlpath.length() - parm); + char *ext = strrchr(urlpath, '.'); String lowerext; if (ext && strchr(ext, '/')) // Ignore a dot if it's not in the ext = NULL; // final component of the path. if (ext) { lowerext = ext; - int parm = lowerext.indexOf('?'); // chop off URL parameter - if (parm >= 0) - lowerext.chop(lowerext.length() - parm); lowerext.lowercase(); if (invalids->Exists(lowerext)) { -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)