Search Engines and Web 2.0 - Report Card

Can search engines access all of the content on the web? The answer is no, but they are certainly trying.

After a recent discussion with Google developers they mentioned future attempts at indexing javascript content. By this they simply meant they were going to start reading inline event handlers containing URLs and not much else. Good news for the lazy developer, bad news for web standards.

What about Flash? or DOM generated content? AJAX, Video, Audio and other modern day techniques used to manipulate content with CSS trickery? Well, let’s stop wondering and start documenting instances of each of these.

I would like to start an experiment in indexing. Starting with the techniques I use everyday in development I would like to see what happens when they are indexed, if they are indexed and finally if these indexed solutions are accessible and/or standards compliant.

This experiment is ongoing. I might start new charts every so often, or just update the old ones. What I would like is to have this information on hand as a reference point for all of those that constantly ask, “Does <insert method here> get indexed?”.

Charts evaluating indexing, accessibility and web standards status are:
Google Chart for CSS, Javascript, DOM and Flash.
Pages that need to be indexed for this experiment:
Search Engine Experiment - Pages to be Indexed.

Please feel free to comment, correct me or contribute ideas if you feel I have missed anything in the charts below. Google will be our first target, then once I set a base I will move onto other search engines and see how they handle content differently.

Comments

Lindsay Evans says: April 12, 2007 @ 11:30 pm

Interesting idea, your chart mostly confirms what I’ve *felt* to be true of how search engines index content, but it’s nice to see research behind it.

Reminds me that I should re-run my whole ‘how search engines handle structural elements’ experiment again, I was meaning to do it once every 6 months, but I’m lazy :)

It would be good to see if there is any difference between inline, in page and linked CSS - from what I remember Googlebot doesn’t grab external style sheets, so there may be different results (haven’t checked this recently, so I may be wrong)

Lucas Ng says: April 23, 2007 @ 11:48 pm

hidden-text-visibility-hidden.html has been indexed by Google :)

From my what I know, Google won’t ‘flag’ a site for dirty css tricks until you trip enough thresholds/filters. For example, if you do things in BULK or you are in a close proxmity to a high pagerank (high authority) link neighbourhood, you would be much more likely to get flagged.

Lindsay, Googlebot has been known to grab external style sheets, but it’s not a sure thing. Most likely once again, a site needs to trip certain threshold or filters to ‘make’ googlebot go after an external css file. (this is a well-documented case of googlebot grabbing css http://ekstreme.com/thingsofsorts/seosem/googlebot-requested-a-css-file)

Google also grabs .js files but from what has been observed, it does only very rudimentary parsing.

Scott, thanks for the Flash experiments, an area I’m not very experienced with at all. At least now I can show our creative team a single table of what works and doesn’t :)

Standardzilla says: April 24, 2007 @ 12:21 am

@lucas - thanks for keeping an eye on that one, I will get to updating those tables soon.

I have a lots more to add to those tables that I figure I know the results, but like you say I think it’s good just to see the results in a table, side-by-side to compare notes.