The Future of�
Improving Search�
|
|
|
|
The same, only more of it |
|
Faster spiders |
|
Bigger indexes (and directories) |
|
More Boolean operators (like
Lexis-Nexis) |
|
The same, but improved functionality |
|
Better query processors |
|
Better translation engines |
|
Faster delivery of Web pages, e.g., Inktomi |
Improving Search:� Inktomi
|
|
|
Started at UC-Berkeley in 1996 by an
engineering professor & one of his graduate students as a faster
spider+query processor |
|
Just another Web site, until they
realized they could speed up delivery of pages by caching |
|
Caching is storing of frequently-used
information� |
Inktomi�s secret:� Caching
|
|
|
|
Caching is storing frequently-used
information: |
|
on your PC |
|
on a Web server |
|
at a search engine |
|
in a whole corporation |
|
in the �server farm� of an ISP (e.g.,
AOL) |
|
Inktomi�s real secret: Turning cache
into cash |
|
now owned by Yahoo! |
Improving Search for Text
|
|
|
|
|
Natural-language queries |
|
This is the AskJeeves model on steroids |
|
Applied more narrowly in certain
fields, e.g. finance and investing: http://www.iphrase.com |
|
Understand the user � keywords,
phrases, sentences, and extended prose including e-mails and IM sessions. But
these can be incomplete, obscure, or �dirty� (e.g., misspellings) |
Improving Search for Text
|
|
|
|
|
Natural-language queries |
|
This is the AskJeeves model on steroids |
|
Applied more narrowly in certain
fields, e.g. finance and investing: http://www.iphrase.com |
|
Understand the user |
|
Search on context, not �silo� |
|
Guide the user � dynamically build a
page on-the-fly around search terms, not just a list of hits |
Improving Search for Text
|
|
|
|
|
Natural-language queries |
|
This is the AskJeeves model on steroids |
|
Applied more narrowly in certain
fields, e.g. finance and investing:� http://www.iphrase.com |
|
Blue Cross/Blue Shield |
|
TD Waterhouse |
|
Lexis-Nexis (!!!) |
|
Motorola |
|
Purchased by IBM announced 11/1/05 |
Improving Search for Text
|
|
|
Text mining � also called �context
analysis� |
|
|
|
But to understand Text mining, we need
to look at Data Mining |
Improving Search for Text
|
|
|
|
Digression on Data mining |
|
Example:� American Express special offers |
|
As computing power and databases
improved, AE could compare buying patterns and choose cardholders who might
buy, say, a leather calendar |
|
What began as a way to identify a few
hundred customers, soon found a few tens, then a few� |
|
And finally, circa 1995, a target
population of one! |
Improving Search for Text
|
|
|
|
Text mining � analogous to Data mining |
|
Compare �word patterns� instead of
buying patterns |
|
Examine unstructured data (outside DBs)
such as email, corporate portal (Web) pages, help files, Word documents,
Excel spreadsheets, etc. |
|
Identify patterns, hence information
and knowledge, that the company didn�t know it knew! |
|
One of the tools of �knowledge
management� |
Improving Search for Text
|
|
|
|
Text mining the Web ???? |
|
The Web is a giant collection of
unstructured pages |
|
Could text mining find knowledge that
the Web (and directories and indexes) doesn�t know it has? |
|
This would be the equivalent of asking,
�What can you tell me about this topic that I don�t know to ask?� |
|
Would it be done within an index?� Or by an outside user with a text mining
application program? |
|
|
Improving Search for Text
|
|
|
|
Text mining the Web ???? |
|
There are no answers at the moment, but
there are some prominent companies with software that might be players: |
|
ClearForest ClearTags (www.clearforest.com) |
|
Entrieva�s SemioMap (www.entrieva.com) |
|
Inxight�s Categorizer (www.inxight.com),
which can also work with data� |
Improving Search for Data
|
|
|
|
Structured information � some of the
�stuff� that neither a typical search engine nor you, as a typical searcher,
is equipped to handle� |
|
Financial reports (public-domain) |
|
Sports statistics |
|
|
|
Web-based tools could give access to
these kinds of data that is currently only possible for users who know highly
specialized query languages. |
Improving Search for Images
|
|
|
|
Remember:� current search capability, such as is offered by AltaVista and
Google, is still largely based on text associated with the images: |
|
Filenames, e.g., mydog.jpg |
|
�alt� attributes within an image tag |
|
The text on a Web page near where the
image is displayed. |
Improving Search for Images
|
|
|
Wait a minute!� Why not just turn loose all this computing
power we have?� Let some server
somewhere decide what the various images are. |
|
|
|
Easier said than done, because
unfortunately� |
Improving Search for Images
|
|
|
A computer cannot distinguish between
this: |
|
|
|
And this: |
|
|
|
|
|
|
|
|
|
|
|
|
|
although it is obvious to us. |
|
|
Improving Search for Images
|
|
|
Technologies to tackle this problem
have been around for some time and at least one, from Virage, was available
on AltaVista in 1996.� It allowed the
user to vary and try to match: |
|
Color � general color impression |
|
Composition � spatial arrangement of
color |
|
Texture � �patterns,� e.g., wood,
granite, clouds |
|
Structure � shape of objects in the
image |
The Virage Image Engine
Improving Search for Images
|
|
|
The problem to be solved is that image
recognition is visual, not semantic.�
Yet the thing we can do best on the Web is word-based searching. |
|
Said differently: We still cannot
distinguish an orange on a green tablecloth from a basketball on green grass. |
|
But the problem is under attack� |
Improving Search for Images
|
|
|
Three of the current players in the
field are: |
|
Pixlogic (www.pixlogic.com) - object-level
info |
|
BioImagene (www.bioimagene.com) - content |
|
Visfinity (www.visfinity.com) - image
management |
Connections and Context
|
|
|
Three alternatives to conventional
search: |
|
Alexa (www.alexa.com) |
|
Intelligent agents |
|
Mapuccino (from IBM) |
|
|