DITA Coursework Blog: DITA Session 04 - Information Retrieval and the Concept of Relevance

For the purposes of this post, some terminology: As Andy Macfarlane stated in the DITA session 04 lecture, 'Retrieval' and 'Search' are interchangeable or synonymous terms', as are the phrases "text retrieval" and "document representation." (Chu 2006)

The main learning outcomes from this session, and the topic of this post are to demonstrate an understanding of:

The three IR points of view
The concept of information needs
Information processes: indexing, searching and query modification
Evaluation in IR and the role of relevance assessment in this process
The difference between database systems and IR

From the user perspective the purpose of information seeking, very broadly speaking, is simply the attempt to satisfy a need for information, or to fill a gap in knowledge. The conventional architect uses 'wayfinding strategies' in order to provide 'information to people say passing through an airport, where to go, where not to go, perhaps even using braille for the visually impaired that all satisfies the need to know where something is or where to go'. (Paraphrase Wodkte and Govella 2009). Web sites use similar navigational techniques, in the form of clickable hyperlinks, allowing users to browse to hopefully fulfill this information need. But what if you aren't quite sure what you are looking for, and are looking for whatever it is in a hurry, or know exactly what you want but not where to find it? Due to the vast amount of information now held on web-servers, this may well be the case, and you need to search for it. "Of all search systems, none has testing usage and investment than web-wide search tools have..." Morville and Rosenfeld 2007.

Information needs can be grouped in three broad groups or 'a taxonomy for web search'.(Broder 2002) The need is translated into natural language and request for that information made by typing it into a search engine or 'query'. (Macfarlane Lecture Notes 2010):

Navigational queries: To find a specific home page of an organisation
Transactional queries: Searching to find a service such as buying a cheap flight or buying a book from an on-line book shop
Informational queries: To search to fulfill a lack of knowledge on a subject, such as how deter slugs coming into your vegetable garden

Navigational queries can be referred to as 'known item retrieval.' In terms of the users need, the query is to find the most relevant web page or pages returned at the top that points directly to the home page or organisation they needed to find, thus a high level of 'precision' is required by the search engine.

A definition of precision taken from http://searchenginewatch.com/2156001 : "The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%." (http://searchenginewatch.com accessed 22-10-2010)

Another possibility is that users be researching for information for references to use in an assignment, as am I for the LIS module on Music and the Information Chain. I enter queries such as:

"Music AND Information Chain" or "The information communication chain applied to Music'

As I am researching a topic I know little about, I need to find as many documents that are 'relevant as possible' thus I would like a high level of recall.

Definition from http://searchenginewatch.com/2156001 for recall:

"Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%." (http://searchenginewatch.com accessed 22-10-2010)

This is one side of the author - user relationship mentioned in the last post. The author(s) of the information, or source would presumably want someone to look at, use or interact with the content they have created and hosted on a web-server and as discussed in earlier posts, the content created takes the form of digital representation of information.

The third point of view, marrying the source to the user is the 'System' view of IR, which sits in the middle, the nuts and bolts of the IR process manifested as software, i.e. The search engine. Morville and Rosenfeld (2007) try to be as succinct as possible when describing the anatomy of search "there is a lot going on here."!

Search engines utilize a process known as indexing, where chosen terms deemed to be 'relevant' to an individual documents are extracted (or abstracted) and stored in a keyword file. Here the term 'relevant' is very subjective, but essentially "the indexer has to choose from related terms, narrower terms, or broader terms, listed in the thesaurus for representation purposes." (Chu 2003). In addition the index could also include words with the same meaning (as stated in the introduction to this post, terms are often interchangeable) thus synonyms are likely to improve relevance. Other indexing mechanisms include:

Phonetic tools: ...they can expand a query on "Smith" to include "Smyth"
Stemming tools: Allowing the user to enter a term e.g. (lodge) and retrieve documents that contain variant terms with the same stem (e.g. "lodging", "lodger"
Natural language processing tools: These can examine the syntactic nature of a query - for example "how to" question or a "who is" question? - and use that knowledge to narrow retrieval

Morville and Rosenfeld (2007)

The origin of information retrieval and search carried out in 1953 by the Cranfield in the UK and the Armed Services Technical Information Agency in the USA. The evaluation of IR using the Uniterm system devised by Mortimer Taube, whereby documents were indexed by representation using single terms from titles and abstracts. (Ellis 1996)

The first step in indexing is deciding what to index, fields such as author, title, date, are potentially useful for IR where the 'Index' or 'Keyword files' list each selected keyword, number of documents containing that word, and a link from each record to a "postings file" or "inverted file". The inverted file (also described in the post on relational databases) contains fields including: a unique document identifier, document statistics such as word frequency in a document, and position information of words in a document and the physical location of that document. These are all sequentially recorded and used by a search engine to help retrieve documents when matching keywords are entered into a search engine.

Searching can be using natural language, natural language and Boolean operators (AND, OR, NOT) used together, or other other approaches such as defining a radius from a certain postcode, or dates from and to for chronological search, but essentially all are referred to as queries or query building.

This highlights one key concept of IR: 'Language', "Language is the language people speak and write. In natural language, no effort is made in IR to limit or define vocabulary, syntax, semantics, and inter-relationships among terms. Additionally, when combined with perhaps 'advance search options' such as specifying metadata parameters like file type, or searching within a specific subset of information, we are query building to enable the most relevant documents to be returned.

In the DITA Session 04 practical, assuming the role of the user or "information seeker", I was given 20 queries to evaluate the relevance of results returned using querying, both natural language requests and combining natural language with Boolean logic operators. (Macfarlane 2010 lecture notes) The exercise was two fold, to try and classify the different types of information seeking needs or 'query types' and more importantly evaluate the produced by two search engines, www.google.co.uk and www.bing.com. The practical lesson was invaluable for putting the theories of information seeking behavior into practice as the whole concept of IR is based upon the users needs and behavior.

This is referred to as 'Relevance Assessment' and I shall post the queries and results of the Assessment in the next post. The methodology for relevance assessment is based on the top 5 results shown for each set of results returned (thus the denominator is 5 when making the calculation), thus allowing a percentage of either precision or recall to be determined depending on the type of query. The need for high precision or high recall is entirely subjective based on the type of query and the needs of the user.

The results from the relevance assessment, compared to the query done in session 03, where SQL queries to return results from structured data in a database, are subjectively relevant to the user. Whereas results using data retrieval from the database are 'clear and unambiguous'. The same query will be used to provide the same results, regardless of the users need for that information.

The learning outcomes from DITA sessions 02, 03 and 04, expanded on in the blog posts so far will form the basis for my first part of the assignment. The focus will be on the two approaches - data retrieval and information retrieval and how and why each is used appropriately to professionally represent digital information in the context of Web 1.0 technology.

DITA Coursework Blog

Friday, 22 October 2010

DITA Session 04 - Information Retrieval and the Concept of Relevance

No comments:

Post a Comment