DITA Coursework Blog: DITA Session 04 Information Retrieval (Notes from IA Blueprints for the web Wodtke and Austin Govella 2009)

In the session on Information Retrieval the main focus was employing and evaluating various techniques for information retrieval for unstructured information (as opposed to structured in databases).

The book Information Architecture for the Web (Wodtke & Govella 2009) provides some useful concepts which I have added here. The book is on short loan thus I will use references collected here in my assignment, in particular where the book refers to information retrieval in the context of web sites.

Key concept is to bridge the gap between the writer (author) and the user (searcher) of information.

In relation to designing good websites, chapter 2: "Who is this site for"; "discover who the target user is"

Metadata: "Metadata is the basis for all organisation systems....It is the brick of the IA house and can be arranged into a wide variety of retrieval systems. depending on what you need. Information can come in many forms - an article, a e-book, a photograph, or a catalog. Some information isn't made of words - for example a Flash movie, a sound in MP3 format, or a photograph. When there are very few words inherent in the information, as with photos and music, metadata helps find it."

3 types of meta data:

Intrinsic . Metadata about the thing's composition. Is it an MS Word document, a JPEG, a 20kb file, or a zip file
Administrative. Metadata about the way the thing will be handled. Is it a temporary thing, or does it need to be archived? Has it been approved for publication?
Descriptive. Metadata about the nature of the thing. This is the most important thing for [our] purposes and the most commonly used on the Web. Is it fiction or fact? Is it an article? What's the subject? What are related subjects?

Google Zeitgeist: "A recent look at the most popular searches at Yahoo! and Google showed that 80% of searches were one and two word queries. Those one or two words have to somehow be enough to turn up the object the user is looking for, and effective metadata is a good way to stretch that word faaaaarther"

"Beyond he inner workings for search, there is a lot to be done in what is know as the "presentation layer" after recall and precision have hopefully been sorted out.

Search must be fast: "Major search engines make sure they have enough computing power to work through their index and get results swiftly. And brag about it: results 1-10 of about 62,700,000 (0.09 seconds)."

Results must load quickly

Results must be scannable. Eye tracking studies done (page 105) done on three major search engines in 2006 by Enquiro. See link to http://www.enquiroresearch.com/ Google had the smallest and tightest "Golden Triangle" - users were able to quickly select a result worth clicking without having t scan more of the page."

Link to Equiros research, which shows Googles "Golden Triangle"

http://www.enquiro.com/enquiro-develops-googles-golden-triangle.php

My next post will be the results of the DITA Session 04 practical on IR, using quantitative evaluation of precision, i.e relevant documents retrieved / total documents retrieved. By entering various 20 searches, using natural language and Boolean operators I will compare the precision of the top 5 results returned by Google and Bing for precision and draw some conclusions.

Further, new developments in search engine technology such as Google suggest, Yahoo! search assist, Vertical search (searching a sub-set of web pages) and best bets, would be another topic I wish to cover in other posts.

One limitation described by Wodtka and Govella 2009, when using large search engines like google and yahoo, as plug ins for search to individual sites, is "[they] are never going to know your business the way you know your business. They don't have you log files, your user testing results, your internal metadata, and your good old-fashioned know how. Plus the rise of open source search tools such as Lucene, ht://dig, SWISH-E, Solr, Ferret, and many more, as well as customisable search services such as Yahoo! BOSS, the basic search problem is much easier to get right with fewer resources." (Wodtka and Govella 2009)

For the next post I will try to summarise the learning outcomes of DITA session 4, present the results from the practical on evaluation suing precision and attempt to make some comparisons between the the types of search used here on unstructured information (e.g. the Web) with that of session 03 on searching structured information held in databases using SQL.

DITA Coursework Blog

Monday, 18 October 2010

DITA Session 04 Information Retrieval (Notes from IA Blueprints for the web Wodtke and Austin Govella 2009)

No comments:

Post a Comment