Getting the Right Results
One of the challenges with the Mini has been getting it to return results without being redundant. We are using an ISAPI rewrite to make our URLs search engine friendly.
Rather than:
mysite.com?product_id=365&department_id=2
our URLs look more like
mysite.com/p/365,2_Big-Purple-Widget.html
The rewriter knows that the “365″ the key for the product and that “2″ is a key for a category and the server can then render the page accordingly. The “Big Purple Widget” part is merely a way of introducing keywords into the URL to benefit our positioning in organic search engine results.
The challenge is that the same product can appear under many different categories and can have as many different URLs.
Consider the following URLs:
mysite.com/p/365,11_Big-Purple-Widget.html
mysite.com/p/365,900_Big-Purple-Widget.html
mysite.com/p/365,,_Big-Purple-Widget.html
Each URL displays the same data, with the second numeric sequence (11 or 900) affecting the way that category navigation tree is rendered. The last URL is a direct path to the product.
The Mini sees each of the pages above as different pages and will return each page for any query for big purple widgets. Using the “filter=p” or “filter=1″ option in the URL does not achieve the desired results because those filters respectively screen out duplicate information in the same directory and duplicate snippets. Using the googleoff/googleon comment tags didn’t work either.
The key to solving the issue was to leverage the last URL in the series, the direct path with no category information. I wrote a script to generate a page of links to product pages for the Mini to crawl. Secondly, I instructed the Mini by way of a regular expression to not return results for any pages that had the second set of numerals (the category id).
It works nicely. The only problem is that with over 30,000 unique SKUs and over 6,000 product display groups, the links page is quite large and takes too long to render if pulled from the database. I solved this problem by using a schedule task to create the pages on a nightly basis, splitting them alphabetically to keep their size down (the Mini won’t index an HTML document larger than 2.5 MB).