Guest Commentary – Vertical Web Search: One Concept, Lots of Approaches

The views expressed in Guest Commentary pieces do not necessarily reflect those of ComparisonEngines.com.

People throw around a lot of fancy techno-babel when introducing me to their vertical search engines. This is my attempt to develop a common foundation for understanding crawlers vs. scrapers vs. whatever else is out there. I asked Jon Glick to write up his opinions for me. Jon Glick is the Sr. Director of Product Search at Become.com. He has previously managed search products for AltaVista and Yahoo!, and is an unabashed fan of full-on crawling.

Please add your own comments below…would love to get a real discussion going on this.

It’s no secret that search is hot right now, and lots of vertical search engines are entering the fray. Most claim to search the web, and many claim to have an elusive “secret sauce” leading to superior performance. So what types of search technologies are out there and how unique/useful are they?

Full-on Web Crawlers: This is the rarest breed, basically companies going out and building a fully functional web crawler akin to Googlebot, MSNBot, or Yahoo! Slurp. These crawlers explore the vastness of the web, following links from page to page, site to site hoovering up billions of pages.

Advantages: Full-on crawlers are unmatched for comprehensiveness. For vertical searches they can often crawl deeper, and have a larger number of relevant pages than general search engines like Google and Yahoo!. Their technology and machine learning systems tend to be highly automated and easily scale to encompass additional information, or serve additional markets.

Challenges: Because they can find tons of content, these crawlers have to be paired with very advanced relevancy algorithms. The bigger the haystack you create, the harder it is to find the needle. In exploring the whole web they also tend to find lots of spam, porn, and “spider traps” (link structures that cause crawlers to get caught in endless loops) so advanced handling of harmful content is essential. Big indexes also need dozens of servers per instance, as your typical multi-billion page index runs 4-10 terabytes! Building a Full-on Crawler and search index is a daunting task; it reportedly took Microsoft a year and a half, $150MM and over 150 employees (including a Field Medal winner) to build their v1.0 from scratch.

Local Crawlers: The baby brothers of the Full-on Crawlers, these systems crawl a limited number of pre-designated sites (often selected by human editors, or from a categorized list such as ODP. The crawler will follow links within each site, but is not allowed to “leave the yard” and wander out into the wilderness of the whole web. These are very popular for building specialty or vertical search engines.

Advantages: They are easy to construct. If you want a specialty automotive search, just identify a few dozen good car sites (ex. Edmunds, Kelly Blue Book, Car and Driver, etc.) and crawl those using some generally available open source code. Viola, you have it. No spam problems (since you’re only indexing trusted sites), and an index small enough to fit on one server. Relevancy algorithms are also much more direct. Complex systems like Google’s PageRank can’t be used (you don’t have a full picture of the web), but really aren’t necessary since very basic systems like keyword frequency work well enough and all the pages are relatively high quality anyway.

Challenges: Local crawlers’ limited scope of discovery means that lots of good information is overlooked. The automotive search in our example above would probably miss sources like newspaper reviews and owner comments from blogs, not to mention merchants selling custom car parts. The relative ease of creating a Local Crawler also means that the space gets very crowded very quickly, and getting other sites to use your technology is difficult, since they can build it themselves.


Scrapers: Scrapers don’t actually index sites the way that major search engines do. Search engines store pages as text, while scrapers attempt to recognize and grab only certain “structured” information to put into their databases.

Advantages: Scrapers convert web pages into structured data, so it’s easy to compare information from across multiple sites and view it in a very organized format. Some comparison shopping engines also use scrapers to get data from online stores who can’t or won’t build their own data feeds.

Challenges: Right now there are two types of Scrapers: Template scrapers and FreeForm scrapers. Template scrapers rely on humans to “tell” them where a site places data on its pages. This makes them very accurate, but the need to create a template for each site means they typically scrape only a small number of fairly large sites. Also, if a site makes changes to its design this may “break” the scraper until the template is updated. FreeForm scrapers on the other hand rely on embedded logic to identify the meaningful data on pages. This lets them grab information from a lot more sites, but the accuracy may suffer. It’s interesting to note that Froogle started out as a FreeFrorm scraper but abandoned that approach in favor of having merchants send it structured data feeds.

MetaSearch: These are similar to Template scrapers, but instead of building an index they ping a very limited number of sites (typically less than a dozen) in realtime and aggregate the results. Search sites like Dogpile and Surfwax are good examples of this technology in action.

Advantages: You can look at the data from a lot of different sites easily and at once. Back in the day (circa 2000 and before they considered Overture and AdSense to be “search engines”), Dogpile was a great way to see how results on Yahoo!, Google, AltaVista, etc. stacked up. They’re also very easy to build, and require very little in the way of hardware since no information is actually stored.

Challenges: Understandably, many sites don’t like it when you hit their servers, take their data, and then monetize it for yourself (go figure). Scrapers and Crawler-based engines almost always send traffic back to the sites that they get the information from; many metasearches do not. Since the systems are easy to build, there also tend to be a lot of them. Sidestep and Mobissimo are just two of almost half a dozen travel metasearch startups from the Bay Area alone. Of all the methods listed here, metasearch is probably the least impressive technologically.

So that’s a quick overview of the major web search technologies out there these days. Each has its niche, but some feature a daunting level of technology (remember Google’s old boast of having 60+ PhDs?), while others (like a simple metaseach) can be put together by two guys in a basement over a weekend. The next time a site talks a big game about their special um…(looking around my desk I spot an errant paperclip) “ClipRank” technology it pays to ask how/if they crawl the web and how they select and rank sites. Some have the secret sauce, while some just re-label 1000 island dressing.


Siva Kumar said

Hi Jon,

Thanks for taking the time to help educate the comparison engines audience on Vertical Web Searches. Thank you also for mentioning FatLens in your post. However, I’d like to point out that your assumption about our technology is incorrect. FatLens is not an example of a scraper.

According to your classification scheme, we would best be described as a crawler aimed at creating a vertical search solution for shopping.

Siva Kumar

PS. The post was very illustrative of how one would view text or Web page search technologies. Shopping as a vertical search area could use a bit more discussion on alternative technology approaches.


Todd Wilson said

Great post. The “Advantages” and “Challenges” sections are especially helpful. In the end, a lot of it comes down to trade-offs. When a company undertakes a project like this, certain questions will inform the approach:

- How accurate and comprehensive does the data need to be?
- What technical (human) resources are available? If you have access to 60+ PhD’s you’re obviously in a different position than if you have only a B.S. in C.S. fresh out of college.
- How many sites will need to be crawled/scraped/extracted?
- What’s the time-line? Sophisticated crawlers can take significantly longer than straightforward template-based extraction engines.

I actually wrote a blog entry not too long ago entitled “Three common methods for data extraction” that may be helpful to some: http://blog.screen-scraper.com/2006/03/21/three-common-methods-for-data-extraction/. It’s a bit more general, and not necessarily targeted to comparison engines. It does, however, also delineate advantages and disadvantages to various approaches.


Randy McClure said

Thank you for the posting on vertical web search. What about social search engines like Eurekster? I just started using this type of collaborative search technology where user input influences the search results. Seems like this type of search technology has a lot of potential.


Jon G. said

Siva,

I chose to classify Fatlens as a scraper based upon the way that the system stores and presents information. Fatlens is extracting select, structured information from web pages so that users can do apples-to-apples comparisons of different ticket offerings. This is quite different than the full page indexing that a general search engine like Google does.
To be clear, Fatlens does do a web crawl (http://fatlens.com/main/fatbot.php) to find the tickets pages it extracts data from (most scrapers do some form of web crawling). If you could provide any additional insight into how FatBot operates (# of sites crawled, how the sites are selected, etc.) that would be very insightful.


Jon G. said

Randy,

I left Eurekster out since their focus is on a relevancy algorithm. When you have a full-on crawl (or use someone else’s) the billions of pages mean that you need a very good relevancy algorithm. Google has PageRank, at Become.com we use AIR technology to retrieve relevant results from our 3.2B indexed pages; Eurekster is looking at user feedback to get the best results.
For years at AltaVista and Yahoo! we looked at user feedback (specifically using CTR on web results) to improve relevancy, but there were always concerns about click fraud. The current social search guys have kinda a catch-22, if they become large and successful they create a huge incentive for spammers to try to add ‘bots to the community to skew the results. One way around this may be to have different social networks around each of the sites, so no one site has enough traffic to be worth trying to distort. This seems to be the promising approach that Eurekster is taking.