Searching the Web
|
|
|
Finding information on the Web� |
|
|
|
What it is not:� idle browsing |
|
What it is:� purposeful searching |
Searching the Web
|
|
|
Web Directories vs. Web Indexes |
|
Spiders and Crawlers |
|
Finding the needle in the haystack � keywords |
Directories vs. Indexes
|
|
|
One possible path down the tree: |
|
animal |
|
cat��
dog�� gerbil�� hamster |
|
Collie�� Dachsund�� German
Shepherd |
|
Toy��
Miniature�� Full |
Directories vs. Indexes
|
|
|
A directory has a hierarchical or tree
structure, |
|
which looks like this in Yahoo
Directory |
|
http://dir.yahoo.com/� |
Slide 5
Directories vs. Indexes
|
|
|
A directory has a hierarchical or tree
structure� like a table of contents |
|
It is context-based�meaning that
�adjacent� information is related |
|
This offers efficient and effective
browsing |
Directories vs. Indexes
|
|
|
An index has no inherent
structure�other than words, hence it is like, well, an index |
|
It has granularity� meaning a detailed
breakdown of where words are on the Web, without context or a sense of
surroundings |
|
This offers efficient and effective
searching |
Directories: Characteristics
|
|
|
Similar to a library or bookstore, with
familiar categories, e.g., pets, history |
Directories: Characteristics
|
|
|
Similar to a library or bookstore, with
familiar categories |
|
Arranged by subject or topic |
|
And then subtopic and sub-subtopic� |
Directories: Characteristics
|
|
|
Similar to a library or bookstore, with
familiar categories |
|
Arranged by subject or topic |
|
And then subtopic and sub-subtopic� |
|
Uses hyperlinks effectively to move
�down� the topics� use your mouse, not your feet! |
Directories: Characteristics
|
|
|
Similar to a library or bookstore, with
familiar categories |
|
Arranged by subject or topic |
|
And then subtopic and sub-subtopic� |
|
Uses hyperlinks effectively to move
�down� the topics�hence well-suited to purposeful browsing |
Directories: Characteristics
|
|
|
Context and hyperlinks work together: |
|
Topic:�
Animals or pets |
|
Subtopic:� Dogs |
|
Sub-subtopic:� Australian Shepherds |
|
Target information:� Finding a breeder, or training, or cost� |
Directories: Issues
|
|
|
Because sites/links are chosen by
editors,� their scope � breadth and
depth � is limited |
|
Editing can introduce bias, personal or
corporate |
|
Editing can give unbalanced coverage,
over- or underemphasizing topics |
|
Currency requires editorial checking of
content, link rot, etc. |
|
Some directories charge for �favorable�
listings |
Directories: Examples
|
|
|
The cream of the crop:� Yahoo !� It is a �closed� directory,
meaning that its editors are its own employees |
|
Open Directory Project uses unpaid
editors and is used by Google (and formerly AltaVista); it is �open� |
|
About.com is a half-open, half-closed
hybrid |
Indexes: Characteristics
|
|
|
An index is a database�like a
dictionary or thesaurus that lists URLs of words and phrases instead of their
definitions |
|
It is machine-created, not human-built |
|
Like any database, it is structured for
efficient machine use, not human use |
|
Hence, it is ideally suited for
searching� and speed! |
Indexes: Issues
|
|
|
Because all sites/links are
included,� their scope � breadth and
depth � is unlimited |
|
Financial costs can limit
scope/content, e.g., frequency of revisiting pages already indexed |
|
Indexing programs offer no quality
review |
|
Requires high user proficiency� |
|
|
Indexes: Issues
|
|
|
Because all sites/links are
included,� their scope � breadth and
depth � is unlimited |
|
Financial costs can limit
scope/content, e.g., frequency of revisiting pages already indexed |
|
Indexing programs offer no quality
review |
|
Requires high user proficiency |
|
Text-focused, less useful for images,
sound |
Indexes: Examples
|
|
|
Google is now the frontrunner |
|
But there may be reasons to use others:� �selective coverage, ease-of-use, comfort � all of which is
driven by past experience� same as preference for a browser |
|
Despite market share of Google, we will
also look at AltaVista because of its historical and technological
innovations |
Indexes: Spiders and Allies
|
|
|
Automatic �spiders� (also robots,
crawlers) find Web pages by following hyperlinks |
|
They retrieve some portion of each page
(title, first lines, full text) |
|
Indexer adds the results to the
database, calculates �relevancy� |
|
Query processor responds to search
requests |
Keywords: An Overview
|
|
|
In Minerva, you can search fields �
title, author, subject, title keywords, subject keywords�but these are like
Yahoo! topics� a librarian has chosen them |
|
In Web search engines such as AltaVista
and Google, you can search full page content, as represented in the indexed
database |
|
This requires a very different skill
set� |
Slide 21
Keywords: An Overview
|
|
|
Choosing keywords is equivalent to
starting at the �bottom� of a directory: |
|
Topic:�
Animals or pets |
|
Subtopic:� Dogs |
|
Sub-subtopic:� Australian Shepherds |
|
Target information:� Finding a breeder, or training, or cost� |
Context vs. Keywords
|
|
|
Topic:�
Animals or pets |
|
Subtopic:� Dogs |
|
Sub-subtopic:� Australian Shepherds |
|
Target information:� Finding a breeder, or training, or cost� |
|
Directory tree |
|
|
|
|
|
|
|
|
|
|
|
���������� Index search string |