DITA Coursework Blog: October 2010

Sunday, 31 October 2010

Information Retrieval - Relevance Assessment, Bing vs Google

During the practical labs in DITA session 04, I carried out an evaluation of information retrieval, focusing on how precise a set of results could be returned from queries using two search engines. Precision is measured by:

Relevant Documents retrieved / Total documents retrieved

Precision measures the quality of the results returned by a search engine. With searches for the term relevant is entirely subjective thus the number of relevant documents retrieved was a personal interpretation of what I deemed to be most relevant for each query.

Methodology

Twenty queries were used to evaluate the relevance of results returned using querying using both natural language searches and then combining natural language with Boolean logic operators and quotes (AND OR ""), first using www.google.co.uk and then www.bing.com.

The top 5 results were deemed to represent the total documents retrieved for the purpose of this evaluation.

The exercise was two fold, to try and classify the different types of information seeking needs or 'query types' and more importantly evaluate the produced by two search engines. IR is based upon the users needs and behavior.

The 3 main types of information need, as defined by Broder (2002) are:

Navigational queries: finding a home page of an organisation
Transactional queries: searcing for a service in order to make a transacton eg. www.ebay.com to purchase an item
Informational queries: to satisy a need for information for example how do I get rid of slugs in my garden?

The 20 queries used were:

1. Find the Website of Oxford University (Navigational)
2. Find the website of the 10th running of the International Society of Music Information Retrieval (Navigational)
3. Find the website of the of the organisation which represents Library Schools in the UK (Navigational)
4. Where can I buy bookcases in the UK? (Transactional)
5. Where can I buy sofa's in the London area? (Transactional)
6. Find sites where you can buy car insurance (Transactional)
7. Find sites where you can compare car insurance (Transactional)
8. Find the cheapest flight to Montevideo, departing next week, returning first week of January (Transactional)
9. Find the cheapest holiday to the Costa del Sol in Spain for the month of July (Transactional)
10 Who is the president of Uruguay? Can you find a biography of them? (Informational)
11. What are Ukuleles? (Informational)
12. Who was captain swing and what role did he play in early 19th Century English history? (Informational)
13. Try and find videos which provide some training on how to use search engines (as many search engines as possible). (Informational)
14. What are 'Jerusalem artichokes' and how do you cook them? Combining AND and the * wildcard could help to return relevance for answering 2 questions (Informational)
15. What are the origins of the Korean War (1950 to 1953)? (Informational)
16. How do hot air Balloons work? NB Boolean query also returned two videos on How hot ait balloons work (Informational)
17. What were the Putney Debates, and what was their impact? (Informational)
18. Who were the Levellers and what role did they play in the English Civil War?
19. Why did Hitler order the invasion of the Soviet Union in 1941 (Informational)
20. Find an image of a happy person. (Navigational)

Overall when the average precision for all search types were:

Using Natural Language: Google:70% Bing:75%
Adding Boolean operators: Google:71% Bing:78%

Analysing the precision of the three types of query, using Broder's Taxonony:

Google (no. of queries)
Navigational (4) 65.00% 46.67%
Transactional (6) 63.33% 70.00%
Informational (10) 76.00% 78.00%

Bing (no. of queries)
Navigational (4) 65.00% 70.00%
Transactional (6) 63.33% 63.33%
Informational (10) 86.00% 90.00%

Conclusions
The evaluation showed Bing returned a higher precision across all query types. In many cases the use of Boolean operators in queries increases the level of precision, with the exception of navigational type queries using Google. The way I used the Boolean operators was based on the success of the natural language results to try to improve precision.

One notable exception was query no. 13. Trying to find videos of search engine tutorials. The information need here was to find as many search engines as possible in the video, and when the natural language query returned videos of search engine strategies. Thus I thought it irrelevant and amended the query to:

Search engine tutorials NOT "search engine strategies" thus trying to omit results for search engine strategies and increase precision for the use of search engines. The query did not appear to omit search engine strategies, and indeed produced less precision. I am unsure why this to be the case.

When using SQL queries in structured data stored in a relational database the types of queries fall squarely in the 'Informational Need' category, but the information required will be discreet. In other words the ambiguity of using search engines in informational searches on the web, leads to a much deeper question, "how relevant is relevant?"

One thing I would like to mention about Bing, in that it is perhaps a much lesser used search engine, compared to the ubiquitous Google, is the use of pop ups that show"more of this page", thus allowing more judgment to be made on the relevance of that result. Google tends to provide a longer abstract, which is truncated and difficult to assess relevance of the result without clicking through to the page itself.

Tuesday, 26 October 2010

Web 1.0 The Problem of Information Retrieval

The problem of Information Retrieval

“Users needs must be the driver for the concepts, principles and development of new technology employed” (Morveille and Rosenfield 2007). This implies users and their behavior have driven the plethora of ways in which digital information is created, represented and retrieved from the Web. The various representations include marked up documents using markup languages HTML (see a simple example at http://www.student.city.ac.uk/~abjb823/DITA02/Index.html) JPEG images, PDF files, and even on-line databases, all referred to as 'content'. Accessing content retrieving information should be a quick and painless exercise.

"The essential problem in information representation and retrieval (according to Chu 2003), remains how to obtain the right information for the right user at the right time, despite the existence of other variables (e.g. user characteristics database coverage)" and further the economic implications for information retrieval. In the context of a company employee, "What does it cost if every employee in your company spends an extra five minutes per day struggling to find answers on your intranet? (Morveille and Rosenfield 2007)
Focusing on two distinct ways users have ways to 'find answers', namely through retrieving information stored as data in relational databases (referred to as the 'relational model' Chu 2003) and information retrieved from web pages stored on the World Wide Web (Discussed further in my post: http://chrisbrookditablog.blogspot.com/2010/10/dita-session-2-internet-and-world-wide.html ), we can critically evaluate the two technologies and how the user can 'obtain the right information at the right time' (Chu 2003)

The Relational Model

In the relational data model, types of things, things relevant in the real world are referred to as 'entities' and are structured into two-dimensional tables, (e.g. students at City university would be an entity, one student per row, name address etc stored in columns, or post-graduate schemes another entity). Each record is then indexed, using an arbitrary unique numerical 'key'.
The importance of indexing comes from the fact that records need to be identified as unique. The term primary key, refers to the unique identifier for a given record in a selected table. This then allows tables of "related" information to be linked, and records to be retrieved from more than one table to suit the needs of the user. (e.g. students and post-graduate course are related entities, a one to one relationship) The relational Database and relational data model provides a much more robust way to store and retrieve data than that stored on the web in other formats. (Chu 2003)

"Traditional Database Management Systems consist of two parts: sequential files and inverted files. The sequential file is the database source, containing information organised into records in the field-record-database structure. Called the sequential file because records are ordered in the sequence they take when they were entered into the database. The inverted file (or index file) provides access to the sequential file, according to given search queries. Called the inverted file, because the order in which the information is presented (i.e. access point first, locator's second) is the reverse of that in the sequential files (i.e locator's first, access point second.)" (Chu 2003)

Creating, updating and retrieving data held in relational database management system is done through queries most commonly SQL or Structured Query Language. "SQL was one of the first languages for Edgar F. Codd's relational model in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks"and became the most widely used language for relational databases." (http://en.wikipedia.org/wiki/SQL Accessed 17-10-2010)

Using sequential tables and Database Management System (DBMS) interfaces to interact with the information using queries gives independence to entities, thus allowing users to retrieve information stored in one or more tables. Maintenance wise, records in those tables can be independently updated, without compromising data integrity. The key to the success of retrieving information from a relation database is ensuring the relationships between tables is understood by the user.
"Today, the relational model is the dominant data model as well as the foundation for the leading DBMS products, which include IBM’s DB2 family, Informix, Oracle, Sybase, Microsoft’s Access and SQLServer, as well as FoxBase and Paradox. RDBMS represent close to a multibillion-dollar industry alone." (http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010)
The advantages to a relational database (over other representations such as documents stored as web pages), according to wiki.answers.com are:
• structural independence
• simplicity
• ad-hoc query capability
• easy to design
• security control
• non procedural access language

(http://wiki.answers.com/Q/What_are_the_benefits_of_relational_data_model Accessed 18-10-2010)

Critically evaluating the technology of databases must also include the disadvantages or shortfalls of the relational model. A description by http://www.aspfree.com:
"There are limitations to the relational database management system. First, relational databases do not have enough storage area to handle data such as images, digital and audio/video. The system was originally created to handle the integration of media, traditional fielded data, and templates. Another limitation of the relational database is its inadequacy to operate with languages outside of SQL. After its original development, languages such as C++ and JavaScript were formed. However, relational databases do not work efficiently with these languages. A third limitation is the requirement that information must be in tables where relationships between entities are defined by values."
http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010)

The use of databases to store information is only practical where data and information can be structured in tables and entities can be represented with discreet values. There must be consistency across the data set in terms of field values.

The World Wide Web

The millions of web pages accessible through an Internet connection can be referred to as ‘unstructured information’ (Macfarlane lecture notes 2010). Distinctly different form the relational model, information is not organized within two-dimensional tables. Rather than using keys to link entities together in relationships, web pages are stored using universal resource locators (URLs) and displayed using a web browser. Information retrieval (in this context) specifically describes retrieval of this unstructured information and the “key concept is to bridge the gap between the writer (author) and the user (searcher) of information.” (Wodtke & Govella 2009)

Information seeking on the web is prevelantly using natural language. According to Google Zeitgeist: "A recent look at the most popular searches at Yahoo! and Google showed that 80% of searches were one and two word queries. Those one or two words have to somehow be enough to turn up the object the user is looking for..." (Wodtke & Govella 2009)
Search engines utilize a process known as indexing, where chosen terms deemed to be 'relevant' to an individual documents are extracted (or abstracted) and stored in a keyword file. Here the term 'relevant' is very subjective, but essentially "the indexer has to choose from related terms, narrower terms, or broader terms, listed in the thesaurus for representation purposes." (Chu 2003). In addition the index could also include words with the same meaning (often terms are often interchangeable) thus synonyms are likely to improve relevance. Other indexing mechanisms include:

• Phonetic tools: Can expand a query on "Smith" to include "Smyth"
• Stemming tools: Allowing the user to enter a term e.g. (lodge) and retrieve documents that contain variant terms with the same stem (e.g. "lodging", "lodger"
• Natural language processing tools: Can examine the syntactic nature of a query - for example "how to" question or a "who is" question? - and use that knowledge to narrow retrieval
(Morville and Rosenfeld 2007)

The first step in indexing is deciding what to index. 'Index' or 'Keyword files' list each selected keyword in the text (excluding stop words such as: a, an, and, the... etc) number of documents containing that word, and a link from each record to a "postings file" or "inverted file". The inverted file (also described in the post on relational databases) contains fields including a unique document identifier, frequency of the keyword and the physical location of that document on the disk.

Searching can be using natural language only or in conjunction with Boolean operators: AND, OR, NOT (Other mechanisms, e.g radius from a certain postcode, can sometimes be utilised) but all are referred to as queries or query building. The purpose of building the query is help to display the either the most 'relevant' document or a lot of relevant documents, dependent on information need.

'Known item' or 'fact retrieval' (Macfarlane 2010), similarly to SQL queries, the user knows the answer exisits, inherent in knowledge gained elsewhere, but needs to supplemented with further information. Searching for your bank's home page so you can make an on-line bill payment is 'known item' retrieval, requiring high 'precision'.

A definition of precision taken from http://searchenginewatch.com/2156001 : "The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%."(http://searchenginewatch.com Accessed 22-10-2010)

Another possibility is researching information for an assignment on a topic one knows little about, as am I for the LIS module on Music and the Information Chain. I enter queries such as: "Music AND Information Chain", in the hope to explore the depth of the subject. I need to find as many documents that are 'relevant' as possible, thus requiring high level of 'recall'. The process of searching is of course rather more complex as users may use iterative searching or as Morville and Rosefeld (2007) describe "The berry picking model, where user start with an information need and move iteratively through an information system along potentially complex paths, picking bits of information ("berries") along the way."

Definition from http://searchenginewatch.com/2156001 for recall:

"Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%." (http://searchenginewatch.com accessed 22-10-2010)
The process of calculating precision and recall is referred to as 'Relevance Assessment'. The need for high precision or high recall is entirely subjective based on the type of query and the needs of the user. An evaluation, using relevance assessment test using 20 queries of varying types using two search engines Google and Bing, was carried out during a lab session, the results of which are shown in my post Information Retrieval - Relevance Assessment, Bing vs Google).

Conclusions

The wholly subjective term of 'what is relevant' highlights the key difference in Information Retrieval on the web and a database. The same query will be used to provide the same results, regardless of the users information seeking behavior in a database, the results are unambiguous.

Web search allows much more flexibility and access to information using natural language, with the limitations of rigid structure removed, information seeking can becomes a learning exercise. The more searches users carried out, the more knowledge is gained on potential information sources for future searches. "Of all search systems, none has more testing usage and investment than Web-wide search tools have..." (Morville and Rosenfeld 2007). This said, the commercial interests of a search engine business can have the power to return results in whatever way it's designers see fit. Using databases and SQL remove this 'unknown' quantity, however that is perhaps for a discussion elsewhere.

References

Heting, C. (2010). Information representation and retrieval in the digital age (2nd ed). Medford, N.J.
Macfarlane, A. (2010). DITA Course lecture notes
Morville, P., & Rosenfeld, L. (2006). Information architecture for the world wide web (3rd ed). Bejing; Farnham: O'Reilly
Wodtke, C., & Govella A. (2009). Information Architecture: Blueprints for the web. (2nd ed.) Berkeley, Calif: New Riders
http://en.wikipedia.org/wiki/SQL, Accessed 17-10-2010
http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/, Accessed 18-10-2010
http://searchenginewatch.com Accessed 22-10-2010

Friday, 22 October 2010

DITA Session 04 - Information Retrieval and the Concept of Relevance

For the purposes of this post, some terminology: As Andy Macfarlane stated in the DITA session 04 lecture, 'Retrieval' and 'Search' are interchangeable or synonymous terms', as are the phrases "text retrieval" and "document representation." (Chu 2006)

The main learning outcomes from this session, and the topic of this post are to demonstrate an understanding of:

The three IR points of view
The concept of information needs
Information processes: indexing, searching and query modification
Evaluation in IR and the role of relevance assessment in this process
The difference between database systems and IR

From the user perspective the purpose of information seeking, very broadly speaking, is simply the attempt to satisfy a need for information, or to fill a gap in knowledge. The conventional architect uses 'wayfinding strategies' in order to provide 'information to people say passing through an airport, where to go, where not to go, perhaps even using braille for the visually impaired that all satisfies the need to know where something is or where to go'. (Paraphrase Wodkte and Govella 2009). Web sites use similar navigational techniques, in the form of clickable hyperlinks, allowing users to browse to hopefully fulfill this information need. But what if you aren't quite sure what you are looking for, and are looking for whatever it is in a hurry, or know exactly what you want but not where to find it? Due to the vast amount of information now held on web-servers, this may well be the case, and you need to search for it. "Of all search systems, none has testing usage and investment than web-wide search tools have..." Morville and Rosenfeld 2007.

Information needs can be grouped in three broad groups or 'a taxonomy for web search'.(Broder 2002) The need is translated into natural language and request for that information made by typing it into a search engine or 'query'. (Macfarlane Lecture Notes 2010):

Navigational queries: To find a specific home page of an organisation
Transactional queries: Searching to find a service such as buying a cheap flight or buying a book from an on-line book shop
Informational queries: To search to fulfill a lack of knowledge on a subject, such as how deter slugs coming into your vegetable garden

Navigational queries can be referred to as 'known item retrieval.' In terms of the users need, the query is to find the most relevant web page or pages returned at the top that points directly to the home page or organisation they needed to find, thus a high level of 'precision' is required by the search engine.

A definition of precision taken from http://searchenginewatch.com/2156001 : "The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%." (http://searchenginewatch.com accessed 22-10-2010)

Another possibility is that users be researching for information for references to use in an assignment, as am I for the LIS module on Music and the Information Chain. I enter queries such as:

"Music AND Information Chain" or "The information communication chain applied to Music'

As I am researching a topic I know little about, I need to find as many documents that are 'relevant as possible' thus I would like a high level of recall.

Definition from http://searchenginewatch.com/2156001 for recall:

"Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%." (http://searchenginewatch.com accessed 22-10-2010)

This is one side of the author - user relationship mentioned in the last post. The author(s) of the information, or source would presumably want someone to look at, use or interact with the content they have created and hosted on a web-server and as discussed in earlier posts, the content created takes the form of digital representation of information.

The third point of view, marrying the source to the user is the 'System' view of IR, which sits in the middle, the nuts and bolts of the IR process manifested as software, i.e. The search engine. Morville and Rosenfeld (2007) try to be as succinct as possible when describing the anatomy of search "there is a lot going on here."!

Search engines utilize a process known as indexing, where chosen terms deemed to be 'relevant' to an individual documents are extracted (or abstracted) and stored in a keyword file. Here the term 'relevant' is very subjective, but essentially "the indexer has to choose from related terms, narrower terms, or broader terms, listed in the thesaurus for representation purposes." (Chu 2003). In addition the index could also include words with the same meaning (as stated in the introduction to this post, terms are often interchangeable) thus synonyms are likely to improve relevance. Other indexing mechanisms include:

Phonetic tools: ...they can expand a query on "Smith" to include "Smyth"
Stemming tools: Allowing the user to enter a term e.g. (lodge) and retrieve documents that contain variant terms with the same stem (e.g. "lodging", "lodger"
Natural language processing tools: These can examine the syntactic nature of a query - for example "how to" question or a "who is" question? - and use that knowledge to narrow retrieval

Morville and Rosenfeld (2007)

The origin of information retrieval and search carried out in 1953 by the Cranfield in the UK and the Armed Services Technical Information Agency in the USA. The evaluation of IR using the Uniterm system devised by Mortimer Taube, whereby documents were indexed by representation using single terms from titles and abstracts. (Ellis 1996)

The first step in indexing is deciding what to index, fields such as author, title, date, are potentially useful for IR where the 'Index' or 'Keyword files' list each selected keyword, number of documents containing that word, and a link from each record to a "postings file" or "inverted file". The inverted file (also described in the post on relational databases) contains fields including: a unique document identifier, document statistics such as word frequency in a document, and position information of words in a document and the physical location of that document. These are all sequentially recorded and used by a search engine to help retrieve documents when matching keywords are entered into a search engine.

Searching can be using natural language, natural language and Boolean operators (AND, OR, NOT) used together, or other other approaches such as defining a radius from a certain postcode, or dates from and to for chronological search, but essentially all are referred to as queries or query building.

This highlights one key concept of IR: 'Language', "Language is the language people speak and write. In natural language, no effort is made in IR to limit or define vocabulary, syntax, semantics, and inter-relationships among terms. Additionally, when combined with perhaps 'advance search options' such as specifying metadata parameters like file type, or searching within a specific subset of information, we are query building to enable the most relevant documents to be returned.

In the DITA Session 04 practical, assuming the role of the user or "information seeker", I was given 20 queries to evaluate the relevance of results returned using querying, both natural language requests and combining natural language with Boolean logic operators. (Macfarlane 2010 lecture notes) The exercise was two fold, to try and classify the different types of information seeking needs or 'query types' and more importantly evaluate the produced by two search engines, www.google.co.uk and www.bing.com. The practical lesson was invaluable for putting the theories of information seeking behavior into practice as the whole concept of IR is based upon the users needs and behavior.

This is referred to as 'Relevance Assessment' and I shall post the queries and results of the Assessment in the next post. The methodology for relevance assessment is based on the top 5 results shown for each set of results returned (thus the denominator is 5 when making the calculation), thus allowing a percentage of either precision or recall to be determined depending on the type of query. The need for high precision or high recall is entirely subjective based on the type of query and the needs of the user.

The results from the relevance assessment, compared to the query done in session 03, where SQL queries to return results from structured data in a database, are subjectively relevant to the user. Whereas results using data retrieval from the database are 'clear and unambiguous'. The same query will be used to provide the same results, regardless of the users need for that information.

The learning outcomes from DITA sessions 02, 03 and 04, expanded on in the blog posts so far will form the basis for my first part of the assignment. The focus will be on the two approaches - data retrieval and information retrieval and how and why each is used appropriately to professionally represent digital information in the context of Web 1.0 technology.

Monday, 18 October 2010

DITA Session 04 Information Retrieval (Notes from IA Blueprints for the web Wodtke and Austin Govella 2009)

In the session on Information Retrieval the main focus was employing and evaluating various techniques for information retrieval for unstructured information (as opposed to structured in databases).

The book Information Architecture for the Web (Wodtke & Govella 2009) provides some useful concepts which I have added here. The book is on short loan thus I will use references collected here in my assignment, in particular where the book refers to information retrieval in the context of web sites.

Key concept is to bridge the gap between the writer (author) and the user (searcher) of information.

In relation to designing good websites, chapter 2: "Who is this site for"; "discover who the target user is"

Metadata: "Metadata is the basis for all organisation systems....It is the brick of the IA house and can be arranged into a wide variety of retrieval systems. depending on what you need. Information can come in many forms - an article, a e-book, a photograph, or a catalog. Some information isn't made of words - for example a Flash movie, a sound in MP3 format, or a photograph. When there are very few words inherent in the information, as with photos and music, metadata helps find it."

3 types of meta data:

Intrinsic . Metadata about the thing's composition. Is it an MS Word document, a JPEG, a 20kb file, or a zip file
Administrative. Metadata about the way the thing will be handled. Is it a temporary thing, or does it need to be archived? Has it been approved for publication?
Descriptive. Metadata about the nature of the thing. This is the most important thing for [our] purposes and the most commonly used on the Web. Is it fiction or fact? Is it an article? What's the subject? What are related subjects?

Google Zeitgeist: "A recent look at the most popular searches at Yahoo! and Google showed that 80% of searches were one and two word queries. Those one or two words have to somehow be enough to turn up the object the user is looking for, and effective metadata is a good way to stretch that word faaaaarther"

"Beyond he inner workings for search, there is a lot to be done in what is know as the "presentation layer" after recall and precision have hopefully been sorted out.

Search must be fast: "Major search engines make sure they have enough computing power to work through their index and get results swiftly. And brag about it: results 1-10 of about 62,700,000 (0.09 seconds)."

Results must load quickly

Results must be scannable. Eye tracking studies done (page 105) done on three major search engines in 2006 by Enquiro. See link to http://www.enquiroresearch.com/ Google had the smallest and tightest "Golden Triangle" - users were able to quickly select a result worth clicking without having t scan more of the page."

Link to Equiros research, which shows Googles "Golden Triangle"

http://www.enquiro.com/enquiro-develops-googles-golden-triangle.php

My next post will be the results of the DITA Session 04 practical on IR, using quantitative evaluation of precision, i.e relevant documents retrieved / total documents retrieved. By entering various 20 searches, using natural language and Boolean operators I will compare the precision of the top 5 results returned by Google and Bing for precision and draw some conclusions.

Further, new developments in search engine technology such as Google suggest, Yahoo! search assist, Vertical search (searching a sub-set of web pages) and best bets, would be another topic I wish to cover in other posts.

One limitation described by Wodtka and Govella 2009, when using large search engines like google and yahoo, as plug ins for search to individual sites, is "[they] are never going to know your business the way you know your business. They don't have you log files, your user testing results, your internal metadata, and your good old-fashioned know how. Plus the rise of open source search tools such as Lucene, ht://dig, SWISH-E, Solr, Ferret, and many more, as well as customisable search services such as Yahoo! BOSS, the basic search problem is much easier to get right with fewer resources." (Wodtka and Govella 2009)

For the next post I will try to summarise the learning outcomes of DITA session 4, present the results from the practical on evaluation suing precision and attempt to make some comparisons between the the types of search used here on unstructured information (e.g. the Web) with that of session 03 on searching structured information held in databases using SQL.

DITA Session 03 More on querying information stored in databases

Further reading into the technologies employed to structure and query information employed in databases is essential to enable a critical knowledge of the nature and constriants of digital information.
SQL is described by http://www.itl.nist.gov/div897/ctg/dm/sql_info.html

SQL is a popular relational database language first standardized in 1986 by the American National Standards Institute (ANSI). Since then, it has been formally adopted as an International Standard by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). It has also been adopted as a Federal Information Processing Standard (FIPS) for the U.S. federal government. (http://www.itl.nist.gov/div897/ctg/dm/sql_info.html accessed 18-10-2010)

My learning outcome from this session showed that SQL must be used in order to query databases and requires knowledge of exact syntax and structure in order to return exactly what the user wants. This implies that information must be entered into the table of a database with no ambiguity and similarly queries must be constructed exactly to return the required data.

Benefits to RDMSs
In trying to research the benfeits of the relation data model, I asked the question using wiki.answers.com (another type of query covered in information retrieval in the next blog.)

Answer:

structural independence
simplicity
ad-hoc query capability
easy to design
security control
non procedural access language

(http://wiki.answers.com/Q/What_are_the_benefits_of_relational_data_model Accessed 18-10-2010)

Database Management Systems such as Microsoft Access use a front end to enable a user with little or no knowledge of SQL to constuct queries using wizards and query builders, that add in features such as data validation to ensure data is entered correctly or queries are constructed correctly. This allows simplicity, easy design and non procedural access language.

Other RDBMS's used in commerically sensitive environments also must have the capability to hold data securely, allowing selective access to certain levels of information or restricting permissions for certain users. This will ensure the both the security of the data, through only allowing 'power users' to edit information and 'administrators' to control permissions and database maintenance tasks.

Other advantages over using the file approach: "Relational databases offer more robust reporting with report generators that filter and display selected fields. Relational databases offer the capability to build your own reporting modules. Most relational databases also offer the capability to import and export data from other software." http://www.databasedev.co.uk/flatfile-vs-rdbms.html Accessed 18-10-2010"

Limitiations to RDBMS

There are several limitaitons to the relational model for databases, a very sueful resouce : http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010) provides a good few examples:

"There are limitations to the relational database management system. First, relational databases do not have enough storage area to handle data such as images, digital and audio/video. The system was originally created to handle the integration of media, traditional fielded data, and templates. Another limitation of the relational database is its inadequacy to operate with languages outside of SQL. After its original development, languages such as C++ and JavaScript were formed. However, relational databases do not work efficiently with these languages. A third limitation is the requirement that information must be in tables where relationships between entities are defined by values."

I understand from my work in the practical and my professional work in document management that the relationships between entities is the key to a successful database.

"Today, the relational model is the dominant data model as well as the foundation for the leading DBMS products, which include IBM’s DB2 family, Informix, Oracle, Sybase, Microsoft’s Access and SQLServer, as well as FoxBase and Paradox. RDBMS represent close to a multibillion-dollar industry alone." (http://www.aspfree.com/c/a/Database/Introduction-to-RDBMS-OODBMS-and-ORDBMS/ Accessed 18-10-2010)

Saturday, 16 October 2010

DITA Session 03 Structuring and querying information stored in databases 11-10-2010

Moving on from the principles of digital data and information stored on file servers, web servers, and hyperlinks using URLs, where the limitations of storing files in folders come with large amounts of data and information. Without some form of relationships (save the links to individual documents e.g web pages) in a filing structure, it become very easy for information across a large organisation, like a university for instance to become both out of date, and not consistent across departments. The information is duplicated, but could easily become inconsistent in both content and format. e.g. a finance department may set up one file or filing structure to store employee information, HR another where the relational data model they could share a need for the same information for instance employee details, maintaining data integrity.

The relational Database and relational data model provides a much more robust way to store and retrieve data and information, which includes both data and documents, for example electronic journals in pdf format, in two dimensional tables.

Traditional Data Base Management Systems"consisting of two parts: sequential files and inverted files. The sequential file is the database source, containing containing information organised records in the field-record-database structure. Called the sequential file because records are order in the sequence they take when they were entered into the database. The inverted file (or index file) provides access to the sequential file, according to given search queries. It is called the inverted file because the order in which the information is presented (I.E. access point first, locator's second) is the reverse of that in the sequential files (i.e locator's first, access point second.)" (Heting 2003)

The importance of indexing comes from the fact that records need to be identified as unique, the term primary key, refers to the unique identifier for a given record in a selected table. This then allows tables of "related" information to be linked, and records to be retrieved from more than one table to suit the needs of the user.

Planning a database structure, fields used in tables must be split into "entities". An entity refers to a 'thing' that has its own set of attributes or fields, thus each entity must represent a single thing (Butterworth Macfarlane Lecture notes 2010). Providing there is a common field in two or more tables, or the 'primary key', the user has the ability to query and return results from more than one table by matching these keys. Relationships between the tables are defined where relationships or "joins" are made between two or more tables. The idea is to minimize redundant data, and allow records to be added, deleted or amended in one table without effecting any other. The process of ensuring the only fields that represent one entity in a table is called normalisation. Whatif.com website goes on to say "normalisation may have the effect of duplicating data within the database and often results in the creation of additional tables. While normalization tends to increase the duplication of data, it does not introduce redundancy, which is unnecessary duplication. ( http://searchsqlserver.techtarget.com/definition/normalization Accessed 17-10-2010).

Creating, updating and query relational database management system is done through queries most commonly SQL or Structured Query Language. "SQL was one of the first languages for Edgar F. Codd's relational model in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks"^[5] and became the most widely used language for relational databases.^[3]^[6] " (http://en.wikipedia.org/wiki/SQL Accessed 17-10-2010)

The next post will show some SQL examples, using simple statements such as CREATE TABLE, SELECT to select specific records to display, and use the WHERE clause to filter only the data the user is interested from 1 or more tables.

.

Friday, 15 October 2010

DITA The Evolution of the Web

To introduce the concepts of Web 1.0 I found this website: http://www.marcuscake.com/wp-content/uploads/economic_development_web1_to_web4.gif (Accessed 15-10-2010) In some way the eveolution of the web, both past present and future is encapsulated in the figure:

For the the first assignment for DITA an introduction to digital data and information and the inception of the internet and WWW. Marcus Cake describes the early stages of the internet as it relates to the economic shift in developed countries, moving from primary and secondary sectors (mining, agriculture, manufacturing) to the tertiary sectors of professional services.

"In the 1990s, the internet provided organisations with a means to distribute basic information to their communities of interest and undertake basic transactions. New organisational structures were not possible. The complementary technologies were not fully evolved and global internet usage by the global community approximately 10%."

Although Marcus Cake is perhaps a financial information scientist (I am not sure of the motivation of his site) but this provides excellent insight into some key concepts of the Web 1.0 "channel" which I will build upon in the wider sense of the digital representation of information.Further, the diagram above charting the evolution towards Web 2.0 , Web 3.0 and Web 4.0 will be further blog topics. This serves as a good piece of background to the Webs current and future direction.

DITA More Cascading Style Sheets

I have been experimenting and reading more on Cascading Style Sheets (CSS) and using the http://www.w3schools.com/css/css_intro.asp (Accessed 13-10-2010) which has an excellent tutorial.

W3C schools also state that:
"When tags like <font>, and color attributes were added to the HTML 3.2 specification, it started a nightmare for web developers. Development of large web sites, where fonts and color information were added to every single page, became a long and expensive process.
To solve this problem, the World Wide Web Consortium (W3C) created CSS.
In HTML 4.0, all formatting could be removed from the HTML document, and stored in a separate CSS file." http://www.w3schools.com/css/css_intro.asp (Accessed 13-10-2010)

This allows much more flexibility for web page design, as user defined elements called "id" and "class" can be given defnined styles, and used in the HTML code to apply style to any elements.

Monday, 11 October 2010

DITA Session 02 More World Wide Web 11-10-2010

Further study of HTML tags which define styles, or attributes, using for example in this code taken from W3C website (Accessed 11-10-2010):

<html>

<body style="background-color:yellow">
<h2 style="background-color:red">This is a heading</h2>
<p style="background-color:green">This is a paragraph.</p>
</body>

</html>

If this was to be applied across more than one page, or column in a table it would need to be written in every document. The Cascading Style sheet is written in a similar from or syntax to HTML, that will define styles such as font colour, font weight, column widths, background colours etc in one file or several files. These are then referenced via a URL in the webpage. This allows the style sheet to be referenced on all pages of a webiste thus making consitents styles and making it easier to maintain or update a whole website. It is the web browser must interpret the CSS file, save it locally then will 'apply presentation rules that they contain to Web documents.

The three components to style sheet syntax: 1. selector 2. property 3. value

eg. define level 1 and 2 headings to be shown centered and coloured blue:

H1, H2 {text-align: center;  color: blue}

Many "selectors" can be listed (comma separated) before the declarations of style.

I am using my website as a tool for trying out HTML and applying style sheets. The of the outcomes from the practical work carried out in session 02, including using HTML tags and CSS are shown by following the link to my personal website on the city university web server: http://www.student.city.ac.uk/~abjb823/DITA02/Index.html

Aside from the practical exercises in the course notes, I will use 2 online HTML tutorials to help HTML and style sheets, W3C Schools and Jalfrezi, links to these references are:
http://www.w3schools.com/html/
http://vzone.virgin.net/sizzling.jalfrezi/iniframe.htm

My next post will be an attempt to summarise concept of the document centered view of information in the context of using the technologies of the Internet, Web Browsers, URLs and associated protocols, HTML and CSS to represent digital information. I shall provide links to the practical work on my website. This will form background for the first part of the Blog Post 1 assessment Web 1.0.

Friday, 8 October 2010

DITA Session 02 The Internet and World Wide Web 03-10-2010

Session 02 Aims and Outcomes were:

Briefly describe both the internet and WWW, the concepts which define them and the differences between them

Describe and where necessary interpret the Hypertext, Data Mark-up and HTML concepts

Generate a simple HTML document and publish to City University Web Server

Definitions and history of the 'Internet': world wide network of networks allowing computers remote from each other to share information. Wikipedia: "The Internet is a global system of interconnected computer networks that use the standard Internet Protocol Suite (TCP/IP) to serve billions of users worldwide." (Accessed 08-10-2010) Based on a 1960s design for electronic communications proposed by the American military. It now has a profound effect on how we use and manage information and has resulted in the World Wide Web.

A further more detailed definition: "The Internet, sometimes called simply "the Net," is a worldwide system of computer networks - a network of networks in which users at any one computer can, if they have permission, get information from any other computer (and sometimes talk directly to users at other computers). It was conceived by the Advanced Research Projects Agency (ARPA) of the U.S. government in 1969 and was first known as the ARPANet. The original aim was to create a network that would allow users of a research computer at one university to be able to "talk to" research computers at other universities. A side benefit of ARPANet's design was that, because messages could be routed or rerouted in more than one direction, the network could continue to function even if parts of it were destroyed in the event of a military attack or other disaster. (http://searchwindevelopment.techtarget.com/definition/Internet Accessed 08-10-2010)

"TCP/IP is actually a suite of protocols. TCP or Transmission Control Protocol ensures your data will get to and from its destination. IP or Internet protocol ensures the proper fastest route will be taken."

Read more: http://www.brighthub.com/computing/windows-platform/articles/20609.aspx#ixzz11nsusiGy Accessed 08-10-2010)

Basically the internet is the road the WWW travels on, the infrastructure. Its invention is accredited to Tim Berners-Lee, a MIT professor who "On 25 December 1990, with the help of Robert Cailliau and a young student at CERN, [he] implemented the first successful communication between an HTTP client and server via the Internet." (Wikipedia accessed 08-10-2010)

He is director of the World Wide Web Consortium (W3C) overseeing the continued development of the internet.

Source: http://www.google.co.uk/images....N.B. further reading on copyright !

World Wide Web: "invented as an internet service that allowed hyperlinked academic documents (web pages to be viewed remotely. "The most widely used part of the Internet is the World Wide Web (often abbreviated "WWW" or called "the Web"). Its outstanding feature is hypertext, a method of instant cross-referencing." (http://searchwindevelopment.techtarget.com/definition/Internet Accessed 08-10-2010) basically the WWW is one tool, using web browsers as a user interface to display web pages, that utilitizes the internet. Other examples of services using the internet email, ftp and telnet (for accessing applications on remote servers as if they were on your own computer).

Its invention accredited to Tim Berners-Lee, a MIT professor who "On 25 December 1990, with the help of Robert Cailliau and a young student at CERN, he implemented the first successful communication between an HTTP client and server via the Internet." (Wikipedia accessed 08-10-2010)

WWW works on client-server architecture, where web-servers (powerful hi speed computers) constantly listen for messages requesting web pages from client computers, armed with a web browser to allow the information returned. The client request a page according to its Universal Reference Locator (URL) broken down, by protocol (http, ftp etc.)://server type (www).domain name (city) .domain (ac.uk) /directory (conditions)/sub-directory(conditionsofuse/file type (html) etc.

Navigation between web pages is possible through he use of HTML (hyper text mark-up language), a natural language system allowing even linking between documents (eg pdfs). Links are embedded into web pages using mark-up elements that contain URLs. Users are unaware of where documents are stored, i.e. which web servers, but will be presented with documents via a suitable browser which interprets the HTML. NB Pages should be saved with the file extension .htm or .html
We call this the "document centered view of information" (Macfarlane Lecture notes 2010)
Other Markup languages such as XML allow web services to be created. (Paraphrased)

Text, images and multipmedia files are just some element making up web pages 'i.e. content' with added styles, semantics (i.e meaning) of the data, applied through using tags.

Eg. anchor tags <a> </a> denotes a hyperlink to a url. <a> href="http://www.bbc.co.uk/" BBC Website </a> shows the text "BBC Website" as click-able and underlined and will open the bbc home page on click.

Attributes can also be added to tags, eg when adding images with the <img> src=pic.jpeg" tag, width="300"' and height="140" we specifiy can the size of the image displayed.

Adding images and files from remote locations, using URL for example allows the "global document centered view of information" (Macfarlane lecture notes 2010) with implications for copyright law, which will be a topic for supplemental reading and appear on the next post.e.g. am I infringing copyright law by displaying the image of Sir Tim Berners-Lee above on this blog?

Practical work in creating, modifying, adding images to web pages using HTML and uploading to the university web pages are exercises carried out in the practical. I shall supplement with work adding content and using HTML tags on my personal website.

I will explore Cascading Style Sheets (CSS) and the syntax used to to create rules for the display and look of the web pages in my website.

Useful how to guide on HTML including using CSS is run by W3C:
http://www.w3schools.com/html/html_intro.asp

Saturday, 2 October 2010

DITA Session 01 Introduction to computing 27-10-2010 Supplimental

Following on from DITA session 1 and the set up of the blog, I undertook further reading on the basics of Information Architecture, use of new technology, information needs and "information seeking behaviours" (Morville and Rosenfield 2006).

What is Information architecture (Morville and Rosenfield 2006)?

The structural design of shared information environments
The combination of organisation, laveling search and navigation systems within websites and intranets
The art and science of shaping information products and experiences to support usability and findability
An emerging discipline and community of practice focussed on bringing principles and design and architecture to the digital landscape

Information overload in modern society has led to the emergence a discipline concerned with structuring, organising and labelling it in order for end users to be able to manage and find what they are concerned with, (and to be distinguished as a discipline from data or knowledge management)it can be classed as both an art and a science (Morville and Rosenfield 2006). Furthermore it sits somewhere in a grey area with other related discipines perhaps all involved in some way with designing, structuring, labelling, using nad finding information. eg. Graphic Design, Interaction design, usability engineering, experience design, software development, enterprise architecture, content management, Knowledge management (Morville and Rosenfield 2006).

Two quite comprehensive definitions of what information architecture is:

http://searchstorage.techtarget.com/sDefinition/0,,sid5_gci509934,00.html
(Moon, John. Definition of information architecture; searchstorage.techtarget.com 09 Nov 2003 accessed 02-10-2010)

Users needs must be the driver for the concepts, principles and development of new technology employed by information architects. The very justification for the discipline must come from the reasons why information architecture's are employable. Morveille and Rosenfield write "What does it cost if every employee in your company spends an extra five minutes per day struggling to find answers on your intranet? What is the cost of frustrating your customers with a poorly organised website? The value of education, brand against the cost of construction and maintenance [of the website or intranet], cost of training it's users are all opportunities and constraints for information architecture work.
Users need to find, access, use and interpret information in a myriad of ways. The formats of information found on the internet range from basic text (ASCII format), figures and data through to marked up documents (using markup languages HTML, XML etc) PDF files and databases all referred to as 'content'.

Users, Content, Context. (Moreville and Rosenberg 2006)
"Good information architecture design is informed by all three areas". The infamous three circles of information architecture. (Moreville and Rosenberg 2006 p.25)

An article found at http://semanticstudios.com/publications/semantics/000149.php (accessed 02-10-2010) shows the three circles of Information Architecture 3.0 including the concept of 'community' rather than 'users' defined by Moreville and Rosenberg

Information seeking behaviours: The motivation of the community or users and it's needs.
"The too simple information Model. Modelling users needs and behavious force us to ask useful questions about what kind of information the user wants, how much information is enough, and how the user actually interacts with the architecture." (Moreville and Rosenberg 2006)

"Searching, browsing and are all methods for finding, and are the basic building blocks of information seeking behaviour" (Moreville and Rosenberg 2006)

Technologies are employed to facilitate the storage, management, search, retrieval, organisation and interpretation of information found on the world wide web. Moreville and Rosenberg 2006 state that "much IA work is centred on making large scale applications work as advertised." and go on to make reference to the specialisations information archtiects find themselves centering on: Content management systems, search engines and portals. (Moreville and Rosenberg 2006)

Next blog post will look at what is the www and what is the internet and be based on teh DITA session 2 aims and outcomes including an introduction to "technologies that enable us to share digital information between computers at remote locations". (Andy Macfarlane 2010, course aims on moodle)

Morville and Rosenfield (2006) Information Architecture for the World Wide Web Sebastapol CA O'Reilly