Directories vs. Indexes | |
Spiders and Crawlers | |
Finding the needle in the haystack � keywords | |
The real power � Boolean operators |
A directory has a hierarchical or tree structure: | |
animal | |
cat�� dog�� gerbil�� hamster | |
Collie�� Dachsund�� German Shepherd | |
Toy�� Miniature�� Full |
One possible path down the tree: | |
animal | |
cat�� dog�� gerbil�� hamster | |
Collie�� Dachsund�� German Shepherd | |
Toy�� Miniature�� Full |
A directory has a hierarchical or tree structure, | |
which looks like this in Yahoo� |
A directory has a hierarchical or tree structure� like a table of contents | |
It is context-based�meaning that �adjacent� information is related | |
This offers efficient and effective browsing |
An index has no inherent structure�other than words, hence it is like, well, an index | |
It has granularity� meaning a detailed breakdown of where words are on the Web, without context or a sense of surroundings | |
This offers efficient and effective searching |
Similar to a library or bookstore, with familiar categories | |
Arranged by subject or topic | |
And then subtopic and sub-subtopic� |
Similar to a library or bookstore, with familiar categories | |
Arranged by subject or topic | |
And then subtopic and sub-subtopic� | |
Uses hyperlinks effectively to move �down� the topics�hence well-suited to purposeful browsing |
Context and hyperlinks work together: | |
Topic:� Animals or pets | |
Subtopic:� Dogs | |
Sub-subtopic:� Australian Shepherds | |
Target information:� Finding a breeder, or training, or cost� |
Because sites/links are chosen by editors,� their scope � breadth and depth � is limited | |
Editing can introduce bias, personal or corporate | |
Editing can give unbalanced coverage, over- or underemphasizing topics | |
Currency requires editorial checking of content, link rot, etc. | |
Some directories charge for �favorable� listings |
The cream of the crop:� Yahoo!� It is a �closed� directory, meaning that its editors are its own employees | |
Open Directory Project uses unpaid editors and is used by Google (and formerly AltaVista); it is �open� | |
About.com is a half-open, half-closed hybrid |
An index is a database�like a dictionary or thesaurus that lists URLs of words and phrases instead of their definitions | |
It is machine-created, not human-built | |
Like any database, it is structured for efficient machine use, not human use | |
Hence, it is ideally suited for searching� and speed! |
Because all sites/links are included,� their scope � breadth and depth � is unlimited | |
Financial costs can limit scope/content, e.g., frequency of revisiting pages already indexed | |
Indexing programs offer no quality review | |
Requires high user proficiency� | |
A search on Australian shepherd� would have given > 75K hits on Google,
6M on A/V in Fall 2001 |
|
Yet Google is a much larger database! | |
So how could this happen?� And why would it not happen now?� [answer will come later] |
Because all sites/links are included,� their scope � breadth and depth � is unlimited | |
Financial costs can limit scope/content, e.g., frequency of revisiting pages already indexed | |
Indexing programs offer no quality review | |
Requires high user proficiency | |
Text-focused, less useful for images, sound |
Google is now the frontrunner | |
But there may be reasons to use others:� �selective coverage, ease-of-use, comfort � all of which is driven by past experience� same as preference for a browser | |
Despite market share of Google, we will also look at AltaVista because of its historical and technological innovations |
Automatic �spiders� (also robots, crawlers) find Web pages by following hyperlinks | |
They retrieve some portion of each page (title, first lines, full text) | |
Indexer adds the results to the database, calculates �relevancy� | |
Query processor responds to search requests |
In Minerva, you can search fields � title, author, subject, title keywords, subject keywords�but these are like Yahoo! topics� a librarian has chosen them | |
In Web search engines such as AltaVista and Google, you can search full page content, as represented in the indexed database | |
This requires a very different skill set� |
Choosing keywords is equivalent to starting at the �bottom� of a directory: | |
Topic:� Animals or pets | |
Subtopic:� Dogs | |
Sub-subtopic:� Australian Shepherds | |
Target information:� Finding a breeder, or training, or cost� |
Topic:� Animals or pets | |
Subtopic:� Dogs | |
Sub-subtopic:� Australian Shepherds | |
Target information:� Finding a breeder, or training, or cost� | |
Directory tree | |
���������� Index search string |
Pick distinctive, unusual, or unique words | |
Vary their order � sail boat vs. boat sail | |
Vary their case � boat vs. Boat vs. BOAT | |
Look at returned results � �hits� � to find� additional keywords | |
Check your spelling! |
�Boolean� refers to George Boole, an 18th century British mathematician who developed much of the logic that underlies computer science | |
�Operators� are mathematical recipes, e.g., in 2+2=4, �+� is the addition operator. | |
�Boolean operators� are recipes for logical combinations |
AND � both keywords connected by this operator must be present on the Web page for a result (�hit�) to be returned | ||
Australian AND shepherd | ||
AND is the default for most search engines | ||
Always type Boolean operators in UPPER CASE |
OR � either keyword connected by this operator must be present on the Web page for a result (�hit�) to be returned | ||
boundary OR dispute | ||
Australian OR shepherd | ||
A few years ago, OR was the default for Yahoo and AltaVista � this explains the Google vs. AV �discrepancy� [earlier slide] |
AND vs. OR:� Australian shepherd | |
AND:� 3,450,000 hits on AltaVista | |
AND: 6,360,000 hits on Google | |
AND:� 3,450,000 on Yahoo | |
OR:� 266,000,000 hits on AltaVista | |
OR:� 269,000,000 on Google | |
OR:� 267,000,000 on Yahoo |
The solution is to search on the phrase Australian shepherd,which is done by placing it in double quotes: | |
�Australian shepherd� | |
This is now almost universal among search engines. |
Australian shepherd | |
default = AND: 3,450,000 hits on AltaVista | |
default = AND: 6,360,000 hits on Google | |
default = AND: 3,450,000 on Yahoo | |
�Australian shepherd� | |
1,670,000 hits on AltaVista | |
3,480,000 hits on Google | |
1,670,000 on Yahoo |
NOT � second keyword connected by this operator must NOT be present on the Web page for a result (�hit�) to be returned | ||
boundary NOT dispute | ||
Australian AND NOT shepherd | ||
�Australian shepherd� AND NOT breeder | ||
NEAR � second keyword connected by this operator must be adjacent to the first on the Web page for a result (�hit�) to be returned | ||
boundary NEAR dispute | ||
�boundary dispute� NEAR Canada | ||
�adjacent� usually means within 10 words | ||
�Only� on AltaVista |
There are two other search tools that are not logical operators, but they are most often combined with Boolean terms to refine searches � �wildcard� and �nesting.� |
�wildcard� � using a special symbol, usually an asterisk (*), to search for part of a word | ||
�boundary dispute resolution� might miss statements such as �X and Y announced today that they had resolved their long-standing boundary dispute.�� Try this instead: | ||
�boundary dispute� AND resol*, or better yet: | ||
�boundary dispute� NEAR resol* | ||
�nesting� � using parentheses to combine various operators and lessen ambiguity | |||
�boundary disputes between the U.S. and Canada� � long, full of assumptions� | |||
assumes all words present | |||
assumes U.S., not US or United States | |||
assumes Canada, not Canadian | |||
An alternative: | ||
�boundary dispute� AND Canada AND (US OR �U.S.� OR �United States�) | ||
Another possibility: | ||
�boundary dispute� AND (Canad* NEAR (US OR �U.S.� OR �United States�)) |
An alternative: | ||
�boundary dispute� AND Canada AND (US OR �U.S.� OR �United States�) | ||
Another possibility: | ||
�boundary dispute� AND (Canad* NEAR (US OR �U.S.� OR �United States�)) | ||
But use caution � nesting is very powerful but an easy place to make mistakes |
The classic operators: | |
AND | |
OR | |
NOT (AND NOT) | |
NEAR | |
And the additions: | |
�phrase in double quotes� | |
wildcard | |
nesting |