In the early 2000s, more than 1,000 different search engines were in existence, although most Web masters focused their efforts on getting good placement in the leading 10. This, however, was easier said than done. InfoWorld explained that the process was more art than science, requiring continuous adjustments and tweaking, along with regularly submitting pages to different engines for good or excellent results.(Hock 2004: 30-33)
The reason for this is that every search engine works differently. Not only are there different types of search engines—those that use spiders to obtain results, directory-based engines, and link-based engines—but engines within each category are unique. They each have different rules and procedures companies need to follow in order to register their site with the engine.
Crawler-based search engines, such as Google, create their listings automatically. They “crawl” or “spider” the web, then people search through what they have found. If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.
System Anatomy of Google: This is a short overview of how the whole system works as pictured in Figure 1. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.(Bradley: 2004: 47-52)
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
Repository: The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib’s speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib’s 3 to 1 compression.
Document Index: The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search.
Hit Lists: A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization — simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding.
Forward Index: The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter.
Inverted Index: The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID’s together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
Crawling the Web: Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace.(Langville 2006: 22-34)
Searching: The goal of searching is to provide quality search results efficiently. Many of the large commercial search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused more on quality of search in our research, although we believe our solutions are scalable to commercial volumes with a bit more effort.
The Ranking System: Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. First, consider the simplest case — a single word query. In order to rank a document with a single word query, Google looks at that document’s hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, …), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together.
Feedback: The ranking function has many parameters like the type-weights and the type-prox-weights. Figuring out the right values for these parameters is something of a black art. In order to do this, we have a user feedback mechanism in the search engine. A trusted user may optionally evaluate all of the results that are returned. This feedback is saved. Then when we modify the ranking function, we can see the impact of this change on all previous searches which were ranked. Although far from perfect, this gives us some idea of how a change in the ranking function affects the search results.(Brin 2010: 3-7)
Yahoo! was founded in 1994 by David Filo and Jerry Yang as a directory of websites. For many years they outsourced their search service to other providers, but by the end of 2002 they realized the importance and value of search and started aggressively acquiring search companies. Overture purchased AllTheWeb and AltaVista. Yahoo! purchased Inktomi (in December 2002) and then consumed Overture (in July of 2003), and combined the technologies from the various search companies they bought to make a new search engine. Yahoo! dumped Google in favor of their own in house technology on February 17th, 2004.
On Page Content: Yahoo! offers a paid inclusion program, so when Yahoo! Search users click on high ranked paid inclusion results in the organic search results Yahoo! profits. Being the #1 content destination site on the web, Yahoo! has a boatload of their own content which they frequently reference in the search results. Since they have so much of their own content and make money from some commercial organic search results it might make sense for them to bias their search results a bit toward commercial websites. Using descriptive page titles and page content goes a long way in Yahoo!
Crawling:Yahoo! is pretty good at crawling sites deeply so long as they have sufficient link popularity to get all their pages indexed. One note of caution is that Yahoo! may not want to deeply index sites with many variables in the URL string, especially since Yahoo! already has a boatload of their own content they would like to promote (including verticals like Yahoo! Shopping) and also, Yahoo! offers paid inclusion, which can help Yahoo! increase revenue by charging merchants to index some of their deep database contents.
Query Processing: Certain words in a search query are better at defining the goals of the searcher. If you search Yahoo! for something like “how to SEO ” many of the top ranked results will have “how to” and “SEO” in the page titles, which might indicate that Yahoo! puts quite a bit of weight even on common words that occur in the search query. Yahoo! seems to be more about text matching when compared to Google, which seems to be more about concept matching.
Link Reputation: Yahoo! is still fairly easy to manipulate using low to mid quality links and somewhat to aggressively focused anchor text. Rand Fishken recently posted about many Technorati pages ranking well for their core terms in Yahoo!. Those pages primarily have the exact same anchor text in almost all of the links pointing at them. Sites with the trust score of Technorati may be able to get away with more unnatural patterns than most webmasters can, but I have seen sites flamethrown with poorly mixed anchor text on low quality links, only to see the sites rank pretty well in Yahoo! quickly.(Bradley: 2004: 47-52)
Being the largest content site on the web makes Yahoo! run into some inefficiency issues due to being a large internal customer. For example, Yahoo! Shopping was a large link buyer for a period of time while Yahoo! Search pushed that they didn’t agree with link buying. Offering paid inclusion and having so much internal content makes it make sense for Yahoo! to have a somewhat commercial bias to their search results. They believe strongly in the human and social aspects of search, pushing products like Yahoo! Answers and My Yahoo!.
MSN Search had many incarnations, being powered by the likes of Inktomi and Looksmart for a number of years. After Yahoo! bought Inktomi and Overture it was obvious to Microsoft that they needed to develop their own search product. They launched their technology preview of their search engine around July 1st of 2004. They formally switched from Yahoo! organic search results to their own in house technology on January 31st, 2005. MSN announced they dumped Yahoo!’s search and program on May 4th, 2006.
On Page Content: Using descriptive page titles and page content goes a long way to help you rank in MSN. I have seen examples of many domains that ranked for things like state name+ insurance type + insurance on sites that were not very authoritative which only had a few instances of state name and insurance as the anchor text. Adding the word health, life, etc. to the page title made the site relevant for those types of insurance, in spite of the site having few authoritative links and no relevant anchor text for those specific niches.
Crawling: MSN has got better at crawling, but I still think Yahoo! and Google are much better at crawling. It is best to avoid session IDs, sending bots cookies, or using many variables in the URL strings. MSN is nowhere near as comprehensive as Yahoo! or Google at crawling deeply through large sites like eBay.com or Amazon.com. All major search engines have internal relevancy measurement teams. MSN seems to be highly lacking in this department, or they are trying to use the fact that their search results are spammy as a marketing angle. MSN is running many promotional campaigns to try to get people to try out MSN Search, and in many cases some of the searches they are sending people to have bogus spam or pornography type results in them. Based on MSN’s lack of feedback or concern toward the obvious search spam noted above on a popular search marketing community site I think MSN is trying to automate much of their spam detection, but it is not a topic you see people talk about very often.
The Different Types of Search Engines
Although the term “search engine” is often used indiscriminately to describe crawler-based search engines, human-powered directories, and everything in between, they are not all the same. Each type of “search engine” gathers and ranks listings in radically different ways.Most people find what they’re looking for on the World Wide Web by using search engines like Yahoo!, Alta Vista, or Google. According to InformationWeek, aside from checking e-mail, searching for information with search engines was the second most popular Internet activity in the early 2000s. Because of this, companies develop and implement strategies to make sure people are able to consistently find their sites during a search. These strategies oftentimes are included in a much broader Web site or Internet marketing plan. Different companies have different objectives, but the main goal is to obtain good placement in search results.
Crawler (Spider) – Based Engines
Crawler-based search engines such as Google and Yahoo, compile their listings automatically. They “crawl” or “spider” the web, and people search through their listings. These listings are what make up the search engine’s index or catalogue. You can think of the index as a massive electronic filing cabinet containing a copy of every web page the spider finds. Because spiders scour the web on a regular basis, any changes you make to a website, or links to or from your own website, may affect your search engine ranking. Although they usually aren’t visible to someone using a Web browser, meta tags are special codes that provide keywords or Web site descriptions to spiders. Keywords and how they are placed, either within actual Web site content or in meta tags, are very important to online marketers. The majority of consumers reach e-commerce sites through search engines, and the right keywords increase the odds a company’s site will be included in search results.
Companies need to choose the keywords that describe their sites to spider-based search engines carefully, and continually monitor their effectiveness. Search engines often change their criteria for listing different sites, and keywords that cause a site to be listed first in a search one day may not work at all the next. Companies often monitor search engine results to see what keywords cause top listings in categories that are important to them. It is also important to remember that it may take a while for a spidered page to be added to the index. Until that happens, it is not available to those searching with a search engine. Search Engine Optimization (SEO) refers to making changes to a website so that it can attain higher search engine positions for specific keyphrases in organic results. Organic results refer to the regular search engine results displayed by crawler-based search engines; as opposed to sponsored results which are paid advertising.
Because spiders are unable to index pictures or read text that is contained within graphics, relying too heavily on such elements was a consideration for online marketers. Home pages containing only a large graphic risked being passed by. An emerging content description language called extensible markup language (XML), similar in some respects to hypertext markup language (HTML), was emerging in the early 2000s. An XML standard known as synchronized multimedia integration language will allow spiders to recognize multimedia elements on Web sites, like pictures and streaming video.
Directories such as Open Directory depend on human editors to compile their listings. Webmasters submit an address, title, and a brief description of their site, and then editors review the submission. Unless you sign up for a paid inclusion program, it may take months for your website to be reviewed. Even then, there’s no guarantee that your website will be accepted. After a website makes it into a directory however, it is generally very difficult to change its search engine ranking. So before you submit to a directory, spend some time working on your titles and descriptions or hire a professional to submit to directories for you.
Link-Based Search Engines
One other kind of search engine provides results based on hypertext links between sites. Rather than basing results on keywords or the preferences of human editors, sites are ranked based on the quality and quantity of other Web sites linked to them. In this case, links serve as referrals. The emergence of this kind of search engine called for companies to develop link-building strategies. By finding out which sites are listed in results for a certain product category in a link-based engine, a company could then contact the sites’ owners—assuming they aren’t competitors—and ask them for a link. This often involves reciprocal linking, where each company agrees to include links to the other’s site.
Besides focusing on keywords, providing compelling content and monitoring links, online marketers rely on other ways of getting noticed. In late 2000, some used special software programs or third-party search engine specialists to maximize results for them. Search engine specialists handle the tedious, never ending tasks of staying current with the requirements of different search engines and tracking a company’s placement. This trend was expected to take off in the early 2000s, according to research from IDC and Netbooster, which found that 70 percent of site owners had plans to use a specialist by 2002. Additionally, some companies pay for special or enhanced listings in different search engines.
Major types of search engines
Keyword Search engines: If You know what you’re looking for, and can describe it with some key words or phrases: Google is always a good bet, since it has the largest index; Yahoo Search is the second most popular keyword search engine; Bing may provide results if the other two don’t work; Exalead is an excellent choice and makes a change from the big 3(Kahaner 2000: 70-81)
Index or Directory based search engines: These search engines arrange data in hierachies from broad to narrow. Good if you need an overview of a subject or you’re not entirely sure of what you want. Yahoo Directory provides 14 main categories; Google Directory provides access to 16 main categories; Virtual libraries from Pinakes. Drill down for the content/sites you need; The Open Directory Project provides access to 16 main categories
Multi or Meta search engines: These search engines are useful if you need to run a comprehensive search quickly across a number of different engines, to compare results or to suggest search engines that you may not have tried before. The majority do a Google, Yahoo, MSN, Ask search (GYMA, or GYM search depending), but there are differences.
|Browsys||Joongel 10||Scour GYM||Symbaloo|
Visual results search engines: Rather than a simple textual list of results some search engines will provide content in a visual format. This is great if you want a change, or to view results differently. These engines also appeal to students and children.
Category search engines: Some search engines will create categories for you to narrow or expand your search criteria. This is good if you don’t want to think, or need some help in areas that you don’t know that well.
Blended results: There are some search engines that will try and blend a variety of results onto one page for you – websites, news, video, images and so on. Good for an overall view of a subject area. Unfortunately there are not very many of them! Only Allplus and MSE360!
Using a simple technology (implanted by OS parent company Microsoft) MSN will be automatically present on hundreds of millions of PCs. In theory, the following idea is available for sale to any entity who, for whatever reason, wishes to have it. But in practice, only one entity has the required interest, scale and capacity to implement it. That entity is Microsoft. Microsoft’s search-engine unit, MSN, is actually an excellent search-engine and a substantial site in its own right. Yet MSN is viewed quite often as the internet venture of a company (Microsoft) whose core identity is not synonymous with the internet. So, as a search-engine, it’s not as iconic and as appreciated as Google or Yahoo. Nonetheless, there can be no doubt that those three (Google, Yahoo, MSN) constitute an elite 3-member group of search-engine behemoths. So while MSN is not as internet-synonymous as the other two, it still occupies the same stratum as them.
But the problem is that, due to MSN being identified more with Microsoft than with the internet, it lags the other two in popularity. One of the worst things that could have happened to MSN has happened viz. Google and Yahoo have collaborated on a joint online ad deal. How does MSN respond? How does MSN attract internet users to use, favor and view it the way they use, favour and view Google and Yahoo?
The thing is that MSN is roughly equal to Yahoo in terms of efficiency, substance and aesthetics and roughly equal to Google in terms of substance and aesthetics. It is highly unlikely that MSN is going to discover or unveil any technology that puts it so far ahead of Google or Yahoo that those search-entities’ users start migrating en-masse to MSN. In fact, loyalty, satisfaction and force-of-habit keeps Google’s and Yahoo’s users rooted to those two search-engines and if for some reason the users of either of those search-engine’s become dissatisfied with one of the latters, then they’re more likely to migrate to the other, not another. Bad news for MSN. MSN’s parent company is Microsoft. Via MSN, Microsoft has a foothold on the internet. But by virtue of its OSs (Windows, Vista, etc) Microsoft also has non-internet access to millions of PCs. Assuming that Microsoft still plans to roll out millions, if not billions, of units of these OSs for the foreseeable future, Microsoft can exploit its enormous PC access to MSN’s advantage.
Bradley, Phil, “The Advanced Internet Searchers Handbook”, Facet Publishing; 3rd Revised edition, USA, 2004.
Brin, Sergey, Page, Lawrence, “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Computer Science Department, Stanford University, Stanford, CA, 2010.
Hock, Randolph, “The Extreme Searchers Internet Handbook: A Guide for the Serious Searcher”, CyberAge Books, 2004.
Kahaner, Larry, “Content Matters Most in Search Engine Placement.”, Information Weekly, 2000.
Langville, Amy N., “Google’s Page Rank and Beyond: The Science of Search Engine Rankings”, Princeton University Press, Princeton, 2006.
McLuhan, Robert. “Search for a Top Ranking.” Marketing 47 – London department, London, 2000.
Retsky, Maxine. “Cyberstuffing—A Dangerous Strategy.” An international magasine “Marketing News”, January 3, 2000.
Schwartz, Matthew. “Search Engines.” Computerworld, May 8, 2000.
Sherman, Chris. “Search Engine Strategies 2000.” Information Today, October issue 2000.
Sherman, Chris, “Google Power”, McGraw-Hill Osborne, New York, 2005.
Vossen, Gottfried, “Unleashing Web 2.0: From Concepts to Creativity”, Morgan Kaufmann, Princeton, 2007.