DITA Coursework Blog: December 2010

Sunday, 12 December 2010

Web 2.0 and Web 3.0 - The Semantic Web

Introduction

Before I began the DITA course and heard the term Web 2.0, it implied to me that there was a release of a new version on a specific date. Since reading Tim O’Reilly’s second go at a definition of Web 2.0 (http://radar.oreilly.com/2006/12/web-20-compact-definition-tryi.html Accessed 26-11-2010) and writing my article: http://chrisbrookditablog.blogspot.com/2010/11/dita-session-05-introducing-web-20-and.html (Accessed 13-11-2010), I’ve discovered that Web 2.0 is a genre of websites and technologies which open up the web for creating, publishing, sharing, recycling, re-using, and re-arranging information. Instigated by increases in geographical broadband coverage, bandwith and reduced cost is now being exploited by ‘a set of social, architectural, and design patterns (websites), which say Governor et al (2009), is resulting in a mass migration of business to the internet as a platform.

A definition of Web 3.0 is even more ambiguous, and I’d rather refer to ‘The Semantic Web’. A common thread running through the concepts of digital architectures and the technologies being developed under the labels of Web 2.0 and the Semantic Web, not too dissimilar to my previous blog entry is ‘unstructured vs structured’ representation of information.

Publishing using Wikis

Govenor et al (2009) describe ‘concepts’ which are features of many [Web 2.0] applications and services found today. They are defined as patterns such as ‘Participation – Collaboration’, ‘Collaborative Tagging’, ‘Software as a Service’, ‘Mashup’, and ‘Structured Information’.

Wikipedia (http://chrisbrookditablog.blogspot.com/2010/11/dita-session-05-introducing-web-20-and.html Accessed 13-11-2010) exemplifies a Web 2.0 website, and has become a household name as an on-line Encyclopedia. Although new articles in 2010 are not being published so often, (http://www.time.com/time/magazine/article/0,9171,1924492,00.html#ixzz17AB03UyU Accessed 02-12-2010), collaborative tagging, editors and readers, who also have a ‘power to edit’ are collaborating and discussing articles, adding further sub categories, helping editors delete pages no longer of relevance. Here I see the network effect: the more users collaborate - quality and accuracy of articles improves. “People all over the world who are interested in a certain topic can collaborate asynchronously to create a living, breathing work.” Governor et al 2009 p. 51) However, in my opinion Wikipedia articles are still pieces of unstructured information i.e. not arranged and governed by the strict rules of a relational database.

Wikipedia utilizes XHTML to publish content; an example of the Wikipedia Main Page XHTML source code is here: http://en.wikipedia.org/wiki/File:Wikipedia_main_page_XHTML_source.png

Recommended by W3C as a mark-up language based on HTML and compatible with XML (http://chrisbrookditablog.blogspot.com/2010/11/extensible-markup-language-xml.html Accessed 12-12-2010)

“XHTML consists of all the elements in HTML 4.01, combined with the strict syntax of XML. Today's market consists of different browser technologies, some browsers run on computers, and some browsers run on mobile phones or other small devices. The last-mentioned do not have the resources or power to interpret a "bad" markup language.”

http://www.w3schools.com/xhtml/xhtml_why.asp Accessed 04-12-2010

This technology gave Wikipedia the ability to become even more accessible in 2008 with the launch of the mobile version of the site. Optimised for mobile browsers, none essential elements of the page are stripped out including sidebars and headers, with collapsible sub sections of articles. Cross device compatibility is one of the key components of Web 2.0 applications.

Bespoke ‘Wikis’ offer a way to publish, disseminate and collaborate within a bounded set of users. In my professional experience I helped publish a user manual for an Electronic Document Management System using wiki technology while working as an information specialist as part of the design team on a new metro project. The purpose was to educate the design team how to use the EDRMS for the project, and disseminate new procedures as and when needed. It would have been difficult and time consuming to produce a traditional manual covering all the differing levels of expertise and needs of the users. For example a Project Manager needing to search for a single PDF document does not need to know features a CAD operator needs to know e.g. creating metadata and how to reference sets of model files to create quality assurance compliant CAD drawings.

A ‘bare bones’ manual was created and relied on the participation of the design team members to ‘flesh out’ content. All had the common interest of making the project a success and initially users contributions added a high degree of value through harnessing their collective knowledge. However, after the initial setting up contributions to the wiki dried up. The shift from a top-down editorial approach to a bottom-up approach is a painful reversal for people who expect only expert advice when they look up something.” (Governer et al 2009 p. 51)

Identifying technologies for use in the Information Sciences

Wikipedia is a published set of linked resources, organized visually through a template and through collaboration of the web community, has built up a vast network of articles and subjects. Articles are arranged by subject and interlinked through hyperlinks; categories are organized into a hierarchy (Interestingly Web 3.0 is a category under the subject Web 2.0). If data relating to subjects and categories were marked-up with XML and metadata applied to give context, rather then just XHTML and were referenced using unique reference identifiers that explicitly described meaning, Wikipedia could become a much more powerful resource for information scientist. W3C schools has recommended the technologies and standards that need to be established to facilitate the idea of semantically describing information as ‘objects’, ‘subjects’ and relationships or ‘predicates’ through RDF Resource Description Framework referred to as RDF triples.

“URI’s can identify anything as a resource, the subject of an RDF statement can be a resource, and predicates in RDF statements are always resources. Because URI’s uniquely identify resources (things in the real world) they are considered ‘strong identifiers’. There is no ambiguity about what they represent, and they always represent the same thing, regardless of the context we find them in.” Sergaran et al 2009 p.66) Thus organizing the important facts that have been added, in a rather ad-hoc fashion to subjects in Wikipedia using RDF statements containing URIrefs, which are built from standard terms (an example of which is provided by the Dublin Core Metadata Initiative http://dublincore.org/ Accessed 02-12-2010) the information becomes highly re-usable.

RDF allows properties to be invented independently regardless of the subject domain. It can be converted to XML and using the simple triple makes it simple to identify a resource (Chowdhury 2007 p.203). Graphically representing RDF statements is a powerful way to visualize the interlinking of objects, so simple triples can be aggregated into complex webs of relationships using simple nodes and connectors. I have drawn a simple RDF graph of my music blog, describing the triples for three objects that ‘belong’ to my music blog: http://chrisbrookditablog.blogspot.com/2010/11/semantic-web-technologies-resource.html (Accessed 26-11-2010)

Wikipedia exemplifies the Web 2.0 principle of collaborative tagging, which could be described as user driven Taxonomy of the world’s knowledge. “Web 2.0 is an informal flexible way of integrating disparate web services. It requires less dependence on shared vocabularies and provides workable rather than totally perfect solutions. (Burke 2009), For Information scientists, this gives rise to ambiguity in describing web resources. The Semantic Web standards and technologies being championed by W3C and its partners attempts to remove this ambiguity.

Representing and organizing data using standard metadata built upon an overarching ontology of semantic meaning, real world ontology is applied to web resources in the same manner as librarians do for traditional printed resources using MARC 21. Using XML and RDF to represent resources using standard metadata such as the Dublic Core and URIrefs, creates a library catalogue potentially extending to the whole World Wide Web.

Ontology is developed within subject domains to model real life objects. They apply a set of rules as to how these objects can relate to one another. “An ontology provides a precise vocabulary with which knowledge can be represented, how they can be grouped, and what relationships connect the together.” (Segaran et al 2009 p. 127). This would further help to structure the data held in Wikipedia, and as it is written using XHTML it is directly compatible with XML, thus making it machine readable and machine understandable. If applied correctly, Ontology could be created based a taxonomy provided by resources on Wikipedia. The W3C standard OWL (Web Ontology Langauge http://chrisbrookditablog.blogspot.com/2010/12/semantic-web-technologies-owl.html Accessed 12-12-2010) a mark-up technology for describing ontology allows machines to understand the relationships and hierarchy of subjects and objects.

Information specialists currently have a great vested interest in utilising these for stacked technologies to interpret data in new ways. Simple RDF graphs can be joined through new relationships, which since built on simple ontological rules allow complex reasoning to be performed using many more variables than we may normally consider. Thus URI’s act like primary keys in a relational database.

Projects to create semantic wikis, from either pre-defined or user created ‘folksonomies’ where “some portion of its data in a way that can be queried elsewhere. Typical uses of such data include querying it within the wiki (sometimes using standard query languages like SPARQL), aggregating it in displays like tables, maps and calendars; exporting it via formats like RDF, OWL or CSV; and reasoning with it, to calculate new facts from the given facts. (http://en.wikipedia.org/wiki/Semantic_Wikipedia Accessed 12-12-2010)

Freebase (http://www.freebase.com/home Accessed 02-12-2010) is an example of a semantic wiki, where articles can be built automatically form multiple resources.

Utilising Web 2.0 and Semantic Web Technologies

Exploring the possibilities of utilizing this now semantically structured data such as held in Freebase or the Government’s proposal to publish its datasets in RDF format, offer the possibility to query and mash together information and data in new ways. “Companies and businesses often need to gather data from a range of sources, XML can serve as a uniform data exchange format, and thus can facilitate such gathering, processing, re-use and distribution of data across various applications” Chowdhury.G 2007 p. 164.

The UK Department for Communities an Local Government in conjunction with Local Authorities planning departments developed and rolled out a standard on-line planning application form called 1APP (http://www.planningportal.gov.uk/PpApplications/genpub/en/Ecabinet. Accessed 04-12-2010). From 6 April 2008, The Standard on-line form allowed Local Authorities in England and Wales to receive planning applications digitally. Through the application of an XML schema, details of planning applications could be captured digitally such as applicant details, type of development, number of housing units , floorspace of commercial development and a range of information that can be directly uploaded into bespoke back-office Planning Systems.

This greatly improved efficiency from the old paper based system, eliminating data entry and scanning. Local Authorities tend to use large and complex databases for dealing with planning application and to develop and roll out a web based system for document and event handling would be far too costly. However, the use of the online XML schema has the power to make the data collected re-usable in other applications. Data could be read and fed into other departments systems and used to calculate statistics on housing and commercial development, and identify trends to appraise the success of land use planning policy.

The flexible nature of data that is semantically described leaves information professionals be able to look for new overlaps of information, taking high level ontological rules and attempting to realise the same relationships in multiple datasets. For example the thousands of datasets held by government departments written in RDF. Ontological rules can be applied to Census data, crime data, housing tenure, population projections, ethnic breakdown, socio-economic classifications, which all describe real world objects. Ordnance Survey have also created an ontology to describe geographical entities using OWL-DL, making geographical locations and defined areas explicit to machines through RDF. http://www.ordnancesurvey.co.uk/oswebsite/ontology/Accessed 01-12-2010.

This has implications for information professionals working in policy research for instance being able to write spatial queries to explore questions not normally possible: What is the prevalent socio-economic class of 25-34 year olds of Somali origin who live in areas in the top 10% areas for crime against the person and live in Council rented dwellings? Here, multiple data sets are queried, including the ability to Geocode pieces of data. Instead of copying data sets into GIS systems to query, the work is all done over the web, thus the information specialist is hopefully assured that she is using the most current data.

Linked data and the governments drive to create the ‘Open Data Movement’ are essentially making data available as a commodity that can be taken, manipulated by private enterprise to create more opportunity for information systems development (http://data.gov.uk/ Accessed 10-12-2010) Data from planning applications, linked using RDF graphs representing all manner of data published by the ONS, can provide local government with information which would be invaluable to service planning.

For example in the future land use planning must consider environmental and societal changes such as rising sea levels and over population and guide the development of the infrastructure to support this.

Planning applications give data on the number of houses to be built. The data is also geocoded so it can easily be mapped using OS geography data published in RDF format. Environment agency publishes flood risk zones, again geocoded. Population projections, give official population statistics. Complex statistical queries could be built up from the data represented by RDF statements to determine where development must be directed in the future to mitigate against flooding, and where services will need to be located to cope with the environmental changes we will see in the future. Policy making thus becomes far more efficient. Using RDF graphs and SPARQL to quickly observe relationships in the real world without the need to manually bring together disparate datasets in a GIS system, the web becomes a platform for government to formulate policy based on evidence provided by linked data sets.

Conclusion

Web 2.0 applications make publishing accessible with minimal effort and rely on user defined tags as metadata. The Semantic Web uses marked-up sections of machine readable data found in databases or documents on the web, and describes them through real-world semantic models or ontologies.

“The Semantic Web is about two things. It is about common formats for integration and combination of data drawn from diverse sources, where on the original Web mainly concentrated on the interchange of documents. It is also about language for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing." (http://www.w3.org/2001/sw/ Accessed 28-11-2010)

The Semantic Web movement will hopefully lead to more merging of web-based data. Discoveries of new information from existing information will become possible by looking for overlaps between new and existing data and could lead to new advances in science, medicine and improve our general understanding of the real world.

References

http://chrisbrookditablog.blogspot.com/2010/12/web-20-and-web-30-semantic-web.html
(URL for this blog post)

Burke, M., The Semantic Web and the Digital Library Aslib Proceedings 61 (3) 2009

Chowdhury, G. G., Chowdhury S., Organizing information from the shelf to the web 2007 Facet

Nickull, D., Hinchcliffe D., Governor, J. Web 2.0 Architectures 2009, O’Reilly

Sergaran, T., Evans C., Taylor J. Programming the Semantic Web 2009 O’Reilly

http://www.w3.org/2001/sw/ Accessed 28-11-2010

http://en.wikipedia.org/wiki/MARC_standards Accessed 28-11-2010)

http://radar.oreilly.com/2006/12/web-20-compact-definition-tryi.html Accessed 26-11-2010

Wikipedia.org Accessed 04-12-2010).

http://en.wikipedia.org/wiki/File:Wikipedia_main_page_XHTML_source.png Accessed 04-12-2010

http://en.wikipedia.org/wiki/Semantic_Wikipedia Accessed 12-12-2010

http://www.freebase.com/home Accessed 02-12-2010

http://chrisbrookditablog.blogspot.com/2010/11/dita-session-05-introducing-web-20-and.html Accessed 13-11-2010

http://chrisbrookditablog.blogspot.com/2010/11/extensible-markup-language-xml.html Accessed 12-12-2010

http://chrisbrookditablog.blogspot.com/2010/11/semantic-web-technologies-resource.html Accessed 26-11-2010

http://chrisbrookditablog.blogspot.com/2010/12/semantic-web-technologies-owl.html Accessed 12-12-2010

http://data.gov.uk/ Accessed 10-12-2010)

http://dublincore.org/ (Dublin Core Metadata Initiative) Accessed 02-12-2010

http://www.ordnancesurvey.co.uk/oswebsite/ontology/ Accessed 01-12-2010

http://www.planningportal.gov.uk/PpApplications/genpub/en/Ecabinet. Accessed 04-12-2010

Sunday, 5 December 2010

The Semantic Web Technologies - OWL

OWL is a language for processing web information.

What is OWL?

OWL stands for Web Ontology Language
OWL is built on top of RDF
OWL is for processing information on the web
OWL was designed to be interpreted by computers
OWL was not designed for being read by people
OWL is written in XML
OWL has three sublanguages
OWL is a W3C standard

What is Ontology?

Ontology is about the exact description of things and their relationships.
For the web, ontology is about the exact description of web information and relationships between web information.

Why OWL?

OWL is a part of the "Semantic Web Vision" - a future where:

Web information has exact meaning
Web information can be processed by computers
Computers can integrate information from the web

OWL was Designed for Processing Information

OWL was designed to provide a common way to process the content of web information (instead of displaying it).
OWL was designed to be read by computer applications (instead of humans).

OWL is Different from RDF

OWL and RDF are much of the same thing, but OWL is a stronger language with greater machine interpretability than RDF.
OWL comes with a larger vocabulary and stronger syntax than RDF.

OWL Sublanguages

OWL has three sublanguages:

OWL Lite
OWL DL (includes OWL Lite)
OWL Full (includes OWL DL)

OWL is Written in XML

By using XML, OWL information can easily be exchanged between different types of computers using different types of operating system and application languages.

OWL is a Web Standard

OWL became a W3C (World Wide Web Consortium) Recommendation in February 2004.
A W3C Recommendation is understood by the industry and the web community as a web standard. A W3C Recommendation is a stable specification developed by a W3C Working Group and reviewed by the W3C Membership.

http://www.w3schools.com/rdf/rdf_owl.asp Accessed 05-12-2010

In computer science and information science, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain.
In theory, an ontology is a "formal, explicit specification of a shared conceptualisation".^[1] An ontology provides a shared vocabulary, which can be used to model a domain — that is, the type of objects and/or concepts that exist, and their properties and relations.^[2]
Ontologies are used in artificial intelligence, the Semantic Web, systems engineering, software engineering, biomedical informatics, library science, enterprise bookmarking, and information architecture as a form of knowledge representation about the world or some part of it. The creation of domain ontologies is also fundamental to the definition and use of an enterprise architecture framework.

http://en.wikipedia.org/wiki/Ontology_%28information_science%29 Accessed 05-12-2010

The Semantic Web Technologies - RDFS

RDF describes resources with classes, properties, and values.

In addition, RDF also need a way to define application-specific classes and properties. Application-specific classes and properties must be defined using extensions to RDF.
One such extension is RDF Schema.

RDF Schema does not provide actual application-specific classes and properties. (This is dealt with OWL in the next post)

Instead RDF Schema provides the framework to describe application-specific classes and properties.
Classes in RDF Schema is much like classes in object oriented programming languages. This allows resources to be defined as instances of classes, and subclasses of classes.

E.g.
<?xml version="1.0"?>

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xml:base="http://www.animals.fake/animals#">

<rdf:Description rdf:ID="animal">
<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Description>

<rdf:Description rdf:ID="horse">
<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
<rdfs:subClassOf rdf:resource="#animal"/>
</rdf:Description>

</rdf:RDF>

In the example above, the resource "horse" is a subclass of the class "animal".
http://www.w3schools.com/rdf/rdf_schema.asp

The purpose of RDFS is then to relate the categories and their hierachy (taxonomy) into a syntax that allows the elements to be grouped into classes using standard vocabulary.
In terms of distinguishing a taxonomy from an ontology, we can use a defnition from Chowdhury 2007:

Machines not only need to read the correct pieces of data, and through the RDF statements can identify a framework for using common metadata standards, however each subject domain has it’s own structure with categories and sub categories. We call these an example of a taxonomy (much like the Wikepedia example where subjects have various categories, a parent-child relationship). “Taxonomy could take the form of a web directory such as Yahoo, subject heading list (e.g the Library of Congress subject heading list) Chowdhury 2007. The RDFS defines a syntax for the taxonomy, defining the parent-child relationships that exist in that hierarchy. (Pidock 2003)?? Quoted from Chowdhury.

The RDFS vocabulary builds on the limited vocabulary of RDF.

[edit] Classes

rdfs:Resource is the class of everything. All things described by RDF are resources.
rdfs:Class declares a resource as a class for other resources.

A typical example of an rdfs:Class is foaf:Person in the Friend of a Friend (FOAF) vocabulary. An instance of foaf:Person is a resource that is linked to the class foaf:Person using the rdf:type property, such as in the following formal expression of the natural language sentence : 'John is a Person'.
ex:John rdf:type foaf:Person
The definition of rdfs:Class is recursive: rdfs:Class is the rdfs:Class of any rdfs:Class.
The other classes described by the RDF and RDFS specifications are:

rdfs:Literal – literal values such as strings and integers. Property values such as textual strings are examples of RDF literals. Literals may be plain or typed.
rdfs:Datatype – the class of datatypes. rdfs:Datatype is both an instance of and a subclass of rdfs:Class. Each instance of rdfs:Datatype is a subclass of rdfs:Literal.
rdf:XMLLiteral – the class of XML literal values. rdf:XMLLiteral is an instance of rdfs:Datatype (and thus a subclass of rdfs:Literal).
rdf:Property – the class of properties.

[edit] Properties

Properties are instances of the class rdf:Property and describe a relation between subject resources and object resources. When used as such a property is a predicate (see also RDF: reification).

rdfs:domain of an rdf:predicate declares the class of the subject in a triple whose second component is the predicate.
rdfs:range of an rdf:predicate declares the class or datatype of the object in a triple whose second component is the predicate.

For example, the following declarations are used to express that the property ex:employer relates a subject, which is of type foaf:Person, to an object, which is of type foaf:Organization:
ex:employer rdfs:domain foaf:Person
ex:employer rdfs:range foaf:Organization
Given the previous two declarations, the following triple requires that ex:John is necessarily a foaf:Person, and ex:CompanyX is necessarily a foaf:Organization:
ex:John ex:employer ex:CompanyX

rdf:type is a property used to state that a resource is an instance of a class.
rdfs:subClassOf allows to declare hierarchies of classes.

For example, the following declares that 'Every Person is an Agent':
foaf:Person rdfs:subClassOf foaf:Agent
Hierarchies of classes support inheritance of a property domain and range (see definitions in next section) from a class to its subclasses.

rdfs:subPropertyOf is an instance of rdf:Property that is used to state that all resources related by one property are also related by another.
rdfs:label is an instance of rdf:Property that may be used to provide a human-readable version of a resource's name.
rdfs:comment is an instance of rdf:Property that may be used to provide a human-readable description of a resource.

[edit] Utility Properties

rdfs:seeAlso is an instance of rdf:Property that is used to indicate a resource that might provide additional information about the subject resource.
rdfs:isDefinedBy is an instance of rdf:Property that is used to indicate a resource defining the subject resource. This property may be used to indicate an RDF vocabulary in which a resource is described.

[edit] See also

SPARQL Query Language for RDF

http://en.wikipedia.org/wiki/RDF_Schema