Tuesday, 26 October 2010

Web 1.0 The Problem of Information Retrieval

The problem of Information Retrieval

“Users needs must be the driver for the concepts, principles and development of new technology employed” (Morveille and Rosenfield 2007). This implies users and their behavior have driven the plethora of ways in which digital information is created, represented and retrieved from the Web. The various representations include marked up documents using markup languages HTML (see a simple example at http://www.student.city.ac.uk/~abjb823/DITA02/Index.html) JPEG images, PDF files, and even on-line databases, all referred to as 'content'. Accessing content retrieving information should be a quick and painless exercise.

"The essential problem in information representation and retrieval (according to Chu 2003), remains how to obtain the right information for the right user at the right time, despite the existence of other variables (e.g. user characteristics database coverage)" and further the economic implications for information retrieval. In the context of a company employee, "What does it cost if every employee in your company spends an extra five minutes per day struggling to find answers on your intranet? (Morveille and Rosenfield 2007)
Focusing on two distinct ways users have ways to 'find answers', namely through retrieving information stored as data in relational databases (referred to as the 'relational model' Chu 2003) and information retrieved from web pages stored on the World Wide Web (Discussed further in my post: http://chrisbrookditablog.blogspot.com/2010/10/dita-session-2-internet-and-world-wide.html ), we can critically evaluate the two technologies and how the user can 'obtain the right information at the right time' (Chu 2003)

The Relational Model

In the relational data model, types of things, things relevant in the real world are referred to as 'entities' and are structured into two-dimensional tables, (e.g. students at City university would be an entity, one student per row, name address etc stored in columns, or post-graduate schemes another entity). Each record is then indexed, using an arbitrary unique numerical 'key'.
The importance of indexing comes from the fact that records need to be identified as unique. The term primary key, refers to the unique identifier for a given record in a selected table. This then allows tables of "related" information to be linked, and records to be retrieved from more than one table to suit the needs of the user. (e.g. students and post-graduate course are related entities, a one to one relationship) The relational Database and relational data model provides a much more robust way to store and retrieve data than that stored on the web in other formats. (Chu 2003)

"Traditional Database Management Systems consist of two parts: sequential files and inverted files. The sequential file is the database source, containing information organised into records in the field-record-database structure. Called the sequential file because records are ordered in the sequence they take when they were entered into the database. The inverted file (or index file) provides access to the sequential file, according to given search queries. Called the inverted file, because the order in which the information is presented (i.e. access point first, locator's second) is the reverse of that in the sequential files (i.e locator's first, access point second.)" (Chu 2003)

Creating, updating and retrieving data held in relational database management system is done through queries most commonly SQL or Structured Query Language. "SQL was one of the first languages for Edgar F. Codd's relational model in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks"and became the most widely used language for relational databases." (http://en.wikipedia.org/wiki/SQL Accessed 17-10-2010)

Using sequential tables and Database Management System (DBMS) interfaces to interact with the information using queries gives independence to entities, thus allowing users to retrieve information stored in one or more tables. Maintenance wise, records in those tables can be independently updated, without compromising data integrity. The key to the success of retrieving information from a relation database is ensuring the relationships between tables is understood by the user.
"Today, the relational model is the dominant data model as well as the foundation for the leading DBMS products, which include IBM’s DB2 family, Informix, Oracle, Sybase, Microsoft’s Access and SQLServer, as well as FoxBase and Paradox. RDBMS represent close to a multibillion-dollar industry alone." (http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010)
The advantages to a relational database (over other representations such as documents stored as web pages), according to wiki.answers.com are:
• structural independence
• simplicity
• ad-hoc query capability
• easy to design
• security control
• non procedural access language

(http://wiki.answers.com/Q/What_are_the_benefits_of_relational_data_model Accessed 18-10-2010)

Critically evaluating the technology of databases must also include the disadvantages or shortfalls of the relational model. A description by http://www.aspfree.com:
"There are limitations to the relational database management system. First, relational databases do not have enough storage area to handle data such as images, digital and audio/video. The system was originally created to handle the integration of media, traditional fielded data, and templates. Another limitation of the relational database is its inadequacy to operate with languages outside of SQL. After its original development, languages such as C++ and JavaScript were formed. However, relational databases do not work efficiently with these languages. A third limitation is the requirement that information must be in tables where relationships between entities are defined by values."
http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010)

The use of databases to store information is only practical where data and information can be structured in tables and entities can be represented with discreet values. There must be consistency across the data set in terms of field values.

The World Wide Web

The millions of web pages accessible through an Internet connection can be referred to as ‘unstructured information’ (Macfarlane lecture notes 2010). Distinctly different form the relational model, information is not organized within two-dimensional tables. Rather than using keys to link entities together in relationships, web pages are stored using universal resource locators (URLs) and displayed using a web browser. Information retrieval (in this context) specifically describes retrieval of this unstructured information and the “key concept is to bridge the gap between the writer (author) and the user (searcher) of information.” (Wodtke & Govella 2009)

Information seeking on the web is prevelantly using natural language. According to Google Zeitgeist: "A recent look at the most popular searches at Yahoo! and Google showed that 80% of searches were one and two word queries. Those one or two words have to somehow be enough to turn up the object the user is looking for..." (Wodtke & Govella 2009)
Search engines utilize a process known as indexing, where chosen terms deemed to be 'relevant' to an individual documents are extracted (or abstracted) and stored in a keyword file. Here the term 'relevant' is very subjective, but essentially "the indexer has to choose from related terms, narrower terms, or broader terms, listed in the thesaurus for representation purposes." (Chu 2003). In addition the index could also include words with the same meaning (often terms are often interchangeable) thus synonyms are likely to improve relevance. Other indexing mechanisms include:

• Phonetic tools: Can expand a query on "Smith" to include "Smyth"
• Stemming tools: Allowing the user to enter a term e.g. (lodge) and retrieve documents that contain variant terms with the same stem (e.g. "lodging", "lodger"
• Natural language processing tools: Can examine the syntactic nature of a query - for example "how to" question or a "who is" question? - and use that knowledge to narrow retrieval
(Morville and Rosenfeld 2007)

The first step in indexing is deciding what to index. 'Index' or 'Keyword files' list each selected keyword in the text (excluding stop words such as: a, an, and, the... etc) number of documents containing that word, and a link from each record to a "postings file" or "inverted file". The inverted file (also described in the post on relational databases) contains fields including a unique document identifier, frequency of the keyword and the physical location of that document on the disk.

Searching can be using natural language only or in conjunction with Boolean operators: AND, OR, NOT (Other mechanisms, e.g radius from a certain postcode, can sometimes be utilised) but all are referred to as queries or query building. The purpose of building the query is help to display the either the most 'relevant' document or a lot of relevant documents, dependent on information need.

'Known item' or 'fact retrieval' (Macfarlane 2010), similarly to SQL queries, the user knows the answer exisits, inherent in knowledge gained elsewhere, but needs to supplemented with further information. Searching for your bank's home page so you can make an on-line bill payment is 'known item' retrieval, requiring high 'precision'.

A definition of precision taken from http://searchenginewatch.com/2156001 : "The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%."(http://searchenginewatch.com Accessed 22-10-2010)

Another possibility is researching information for an assignment on a topic one knows little about, as am I for the LIS module on Music and the Information Chain. I enter queries such as: "Music AND Information Chain", in the hope to explore the depth of the subject. I need to find as many documents that are 'relevant' as possible, thus requiring high level of 'recall'. The process of searching is of course rather more complex as users may use iterative searching or as Morville and Rosefeld (2007) describe "The berry picking model, where user start with an information need and move iteratively through an information system along potentially complex paths, picking bits of information ("berries") along the way."

Definition from http://searchenginewatch.com/2156001 for recall:

"Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%." (http://searchenginewatch.com accessed 22-10-2010)
The process of calculating precision and recall is referred to as 'Relevance Assessment'. The need for high precision or high recall is entirely subjective based on the type of query and the needs of the user. An evaluation, using relevance assessment test using 20 queries of varying types using two search engines Google and Bing, was carried out during a lab session, the results of which are shown in my post Information Retrieval - Relevance Assessment, Bing vs Google).

Conclusions

The wholly subjective term of 'what is relevant' highlights the key difference in Information Retrieval on the web and a database. The same query will be used to provide the same results, regardless of the users information seeking behavior in a database, the results are unambiguous.

Web search allows much more flexibility and access to information using natural language, with the limitations of rigid structure removed, information seeking can becomes a learning exercise. The more searches users carried out, the more knowledge is gained on potential information sources for future searches. "Of all search systems, none has more testing usage and investment than Web-wide search tools have..." (Morville and Rosenfeld 2007). This said, the commercial interests of a search engine business can have the power to return results in whatever way it's designers see fit. Using databases and SQL remove this 'unknown' quantity, however that is perhaps for a discussion elsewhere.

References

Heting, C. (2010). Information representation and retrieval in the digital age (2nd ed). Medford, N.J.
Macfarlane, A. (2010). DITA Course lecture notes
Morville, P., & Rosenfeld, L. (2006). Information architecture for the world wide web (3rd ed). Bejing; Farnham: O'Reilly
Wodtke, C., & Govella A. (2009). Information Architecture: Blueprints for the web. (2nd ed.) Berkeley, Calif: New Riders
http://en.wikipedia.org/wiki/SQL, Accessed 17-10-2010
http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/, Accessed 18-10-2010
http://searchenginewatch.com Accessed 22-10-2010

No comments:

Post a Comment