Guest Commentary – Vertical Web Search: One Concept, Lots of Approaches
The views expressed in Guest Commentary pieces do not necessarily reflect those of ComparisonEngines.com.
People throw around a lot of fancy techno-babel when introducing me to their vertical search engines. This is my attempt to develop a common foundation for understanding crawlers vs. scrapers vs. whatever else is out there. I asked Jon Glick to write up his opinions for me. Jon Glick is the Sr. Director of Product Search at Become.com. He has previously managed search products for AltaVista and Yahoo!, and is an unabashed fan of full-on crawling.
Please add your own comments below…would love to get a real discussion going on this.
It’s no secret that search is hot right now, and lots of vertical search engines are entering the fray. Most claim to search the web, and many claim to have an elusive “secret sauce” leading to superior performance. So what types of search technologies are out there and how unique/useful are they?
Full-on Web Crawlers: This is the rarest breed, basically companies going out and building a fully functional web crawler akin to Googlebot, MSNBot, or Yahoo! Slurp. These crawlers explore the vastness of the web, following links from page to page, site to site hoovering up billions of pages.
Advantages: Full-on crawlers are unmatched for comprehensiveness. For vertical searches they can often crawl deeper, and have a larger number of relevant pages than general search engines like Google and Yahoo!. Their technology and machine learning systems tend to be highly automated and easily scale to encompass additional information, or serve additional markets.
Challenges: Because they can find tons of content, these crawlers have to be paired with very advanced relevancy algorithms. The bigger the haystack you create, the harder it is to find the needle. In exploring the whole web they also tend to find lots of spam, porn, and “spider traps” (link structures that cause crawlers to get caught in endless loops) so advanced handling of harmful content is essential. Big indexes also need dozens of servers per instance, as your typical multi-billion page index runs 4-10 terabytes! Building a Full-on Crawler and search index is a daunting task; it reportedly took Microsoft a year and a half, $150MM and over 150 employees (including a Field Medal winner) to build their v1.0 from scratch.
Local Crawlers: The baby brothers of the Full-on Crawlers, these systems crawl a limited number of pre-designated sites (often selected by human editors, or from a categorized list such as ODP. The crawler will follow links within each site, but is not allowed to “leave the yard” and wander out into the wilderness of the whole web. These are very popular for building specialty or vertical search engines.
Advantages: They are easy to construct. If you want a specialty automotive search, just identify a few dozen good car sites (ex. Edmunds, Kelly Blue Book, Car and Driver, etc.) and crawl those using some generally available open source code. Viola, you have it. No spam problems (since you’re only indexing trusted sites), and an index small enough to fit on one server. Relevancy algorithms are also much more direct. Complex systems like Google’s PageRank can’t be used (you don’t have a full picture of the web), but really aren’t necessary since very basic systems like keyword frequency work well enough and all the pages are relatively high quality anyway.
Challenges: Local crawlers’ limited scope of discovery means that lots of good information is overlooked. The automotive search in our example above would probably miss sources like newspaper reviews and owner comments from blogs, not to mention merchants selling custom car parts. The relative ease of creating a Local Crawler also means that the space gets very crowded very quickly, and getting other sites to use your technology is difficult, since they can build it themselves.
Scrapers: Scrapers don’t actually index sites the way that major search engines do. Search engines store pages as text, while scrapers attempt to recognize and grab only certain “structured” information to put into their databases.
Advantages: Scrapers convert web pages into structured data, so it’s easy to compare information from across multiple sites and view it in a very organized format. Some comparison shopping engines also use scrapers to get data from online stores who can’t or won’t build their own data feeds.
Challenges: Right now there are two types of Scrapers: Template scrapers and FreeForm scrapers. Template scrapers rely on humans to “tell” them where a site places data on its pages. This makes them very accurate, but the need to create a template for each site means they typically scrape only a small number of fairly large sites. Also, if a site makes changes to its design this may “break” the scraper until the template is updated. FreeForm scrapers on the other hand rely on embedded logic to identify the meaningful data on pages. This lets them grab information from a lot more sites, but the accuracy may suffer. It’s interesting to note that Froogle started out as a FreeFrorm scraper but abandoned that approach in favor of having merchants send it structured data feeds.
MetaSearch: These are similar to Template scrapers, but instead of building an index they ping a very limited number of sites (typically less than a dozen) in realtime and aggregate the results. Search sites like Dogpile and Surfwax are good examples of this technology in action.
Advantages: You can look at the data from a lot of different sites easily and at once. Back in the day (circa 2000 and before they considered Overture and AdSense to be “search engines”), Dogpile was a great way to see how results on Yahoo!, Google, AltaVista, etc. stacked up. They’re also very easy to build, and require very little in the way of hardware since no information is actually stored.
Challenges: Understandably, many sites don’t like it when you hit their servers, take their data, and then monetize it for yourself (go figure). Scrapers and Crawler-based engines almost always send traffic back to the sites that they get the information from; many metasearches do not. Since the systems are easy to build, there also tend to be a lot of them. Sidestep and Mobissimo are just two of almost half a dozen travel metasearch startups from the Bay Area alone. Of all the methods listed here, metasearch is probably the least impressive technologically.
So that’s a quick overview of the major web search technologies out there these days. Each has its niche, but some feature a daunting level of technology (remember Google’s old boast of having 60+ PhDs?), while others (like a simple metaseach) can be put together by two guys in a basement over a weekend. The next time a site talks a big game about their special um…(looking around my desk I spot an errant paperclip) “ClipRank” technology it pays to ask how/if they crawl the web and how they select and rank sites. Some have the secret sauce, while some just re-label 1000 island dressing.
