The Invisible Web

What is it?

Who is to blame?

Why/where is it?

When will it be visible?

What is the Invisible Web?

�Pages� (text, images, other files, other info) accessible over the Internet with a Web browser that search engines do not include in their indexes, because either

they are technically inaccessible, or

they are excluded by choice.

Who is to blame?

No one and everyone:

Directories don�t care about completeness

Indexes can�t keep up with growth of pages

Webmasters may not welcome spiders

We don�t want to pay for information

Why/Where is it?

Pages are ordinary, but inaccessible.

�In the context of the Bow Tie Theory they are �disconnected.�

Hence spiders/crawlers cannot find them.

Why/Where is it?

Pages are ordinary, but excluded by

Webmaster:� Robots Exclusion Protocol,

which is implemented by including a file such as www.mysite.com/robots.txt, containing

User-agent:� *

Disallow:� /

Why/Where is it?

Pages are ordinary, but excluded by

Webmaster:� Robots Exclusion Protocol

Webmaster:� Robots META tag, which is implemented by putting a line like this in the <head> section of the HTML code:

<meta name=�robots" content=�noindex, nofollow">

Why/Where is it?

Pages are ordinary, but excluded by

Robots Exclusion Protocol or META tag, because

Content changes frequently

Extra load on server

Older content is archived/pay-only

Search engine:

some content is �unworthy�

some content is too deep

Why/Where is it?

Pages are ordinary*, but incomprehensible

images (.gif, .jpg files)

audio (.wav files)

video (.mpg, .mov files)

*definition of �ordinary� in this context:

images display in a browser; audio/video files require a widely-available plug-in (e.g., MS Media Player)

Why/Where is it?

Pages are ordinary, but ephemeral � a faster version of the �newspaper archive� problem

weather data

stock-market data

flight arrival/departure data

Why/Where is it?

Pages are extraordinary though accessible

PDF (Portable Document Format) files � Adobe Acrobat Reader [has become �ordinary�]

Postscript files (the choice in computer science)

Flash, Shockwave, dynamic graphics

programs (executables, .exe)

compressed files (.zip, .tar)

Technically, these can be indexed; economically, they cannot.

Why/Where is it?

Pages are extraordinary and not accessible� the Really Invisible Web� at least for now.

Pages that (may) require a sign-in, for example

The New York Times (required by site)

eBay, Amazon.com, Travelocity, TowerRecords (required by visitor)

Why/Where is it?

Pages are extraordinary and not accessible� the Really Invisible Web� at least for now.

Pages that (may) require a sign-in

Data (really �databases�) that must be reached through forms (text boxes, radio buttons, etc.) in a Web page, for example

towerrecords.com

amazon.com� �The Infinite Regress� problem

When will it be visible?

Well, a few years ago, Google didn�t exist.

AltaVista had no images in its index.

Now both offer images as �ordinary.�

Surely, PDF, Postscript, and the like are just around the corner, at least in Internet years.

Can databases be far behind?


	What is it?
	Who is to blame?
	Why/where is it?
	When will it be visible?


	�Pages� (text, images, other files, other info) accessible over the Internet with a Web browser that search engines do not include in their indexes, because either
		they are technically inaccessible, or
		they are excluded by choice.


	No one and everyone:
		Directories don�t care about completeness
		Indexes can�t keep up with growth of pages
		Webmasters may not welcome spiders
		We don�t want to pay for information


	Pages are ordinary, but inaccessible.
		�In the context of the Bow Tie Theory they are �disconnected.�
		Hence spiders/crawlers cannot find them.


	Pages are ordinary, but excluded by
		Webmaster:� Robots Exclusion Protocol,
		which is implemented by including a file such as www.mysite.com/robots.txt, containing
		User-agent:� *
		Disallow:� /


	Pages are ordinary, but excluded by
		Webmaster:� Robots Exclusion Protocol
		Webmaster:� Robots META tag, which is implemented by putting a line like this in the <head> section of the HTML code:
		<meta name=�robots" content=�noindex, nofollow">


Pages are ordinary, but excluded by
	Robots Exclusion Protocol or META tag, because
		Content changes frequently
		Extra load on server
		Older content is archived/pay-only
	Search engine:
		some content is �unworthy�
		some content is too deep


	Pages are ordinary*, but incomprehensible
		images (.gif, .jpg files)
		audio (.wav files)
		video (.mpg, .mov files)

	*definition of �ordinary� in this context:
	images display in a browser; audio/video files require a widely-available plug-in (e.g., MS Media Player)


	Pages are ordinary, but ephemeral � a faster version of the �newspaper archive� problem
		weather data
		stock-market data
		flight arrival/departure data


	Pages are extraordinary though accessible
		PDF (Portable Document Format) files � Adobe Acrobat Reader [has become �ordinary�]
		Postscript files (the choice in computer science)
		Flash, Shockwave, dynamic graphics
		programs (executables, .exe)
		compressed files (.zip, .tar)
	Technically, these can be indexed; economically, they cannot.


Pages are extraordinary and not accessible� the Really Invisible Web� at least for now.
	Pages that (may) require a sign-in, for example
		The New York Times (required by site)
		eBay, Amazon.com, Travelocity, TowerRecords (required by visitor)


Pages are extraordinary and not accessible� the Really Invisible Web� at least for now.
	Pages that (may) require a sign-in
	Data (really �databases�) that must be reached through forms (text boxes, radio buttons, etc.) in a Web page, for example
		towerrecords.com
		amazon.com� �The Infinite Regress� problem


	Well, a few years ago, Google didn�t exist.
	AltaVista had no images in its index.
	Now both offer images as �ordinary.�
	Surely, PDF, Postscript, and the like are just around the corner, at least in Internet years.
	Can databases be far behind?