WEB Languages – XML, SGML

WEB Languages – XML, SGML for Libraries and Information Centres

Introduction

            XML and SGML are web languages that herald the future languages of the Web. In a sense, they are the next steps after HTML and represent the continued growth of the web for future applications, languages and formats. As is generally known, HTML stands for Hypertext Markup language. Less known, XML stands for Extensible Markup Language. SGML stands for Standard Generalized Markup Language. To understand all of these languages, one must understand the concept of mark-up.

            Markup is everything in a document that is not content. Markup originally referred to the handwritten notations that a designer would add to typewritten text in the days before computers; these notations contained instructions to a typesetter about how to lay out the copy, what typeface to use and other things that communicated to the typesetter about the type or structure that a publisher or author wanted for a document.

            The Standard Generalized Markup Language, or SGML, is an international standard (ISO 8879) first published in 1986. SGML prescribes a standard format for embedding descriptive markup within a document and was first used so that standards could be established for computers carrying out mark-up. More importantly, and crucial to its real value and power, SGML also specifies a standard method for describing the structure of a document. For the original World Wide Web, SGML proved unwieldy and too large so a subset of this, HTML was used for the mark-up commonly seen on the web today. Because of the continual expansion of the web, a halfway language, or intermediary language was needed to bridge the functionality of HTML with the potential of SGML, hence XML.

 

 

What is XML?

 

The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web. The Extensible Markup Language (XML) is descriptively identified in the XML 1.0 W3C Recommendation as "an extremely simple dialect [or 'subset'] of SGML" the goal of which "is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML," for which reason "XML has been designed for ease of implementation, and for interoperability with both SGML and HTML."[1] In a sense, XML is an intermediary language that bridges both HTML and SGML.

 

           

            XML was initially "developed by a W3C Generic SGML Editorial Review Board formed under the auspices of the W3 Consortium in 1996 and chaired by Jon Bosak, chief computer science engineer from Sun Microsystems. While both industry and academic parties were involved in XML’s creation, it was not meant to be a commercial product but again a universal standard for the web

Who Uses XML?

The original rhetoric around XML surrounded business/science engineering applications. For example, "XML is primarily intended to meet the requirements of large-scale Web content providers for industry-specific markup, vendor-neutral data exchange, media-independent publishing, one-on-one marketing, workflow management in collaborative authoring environments, and the processing of Web documents by intelligent clients. It is also expected to find use in certain metadata applications.”[2] As the web grew though, hackers and other interested parties began to take up XML so that now a growing and dedicated community makes use of XML. While it is still not as widely used as HTML, it continues to gain popularity and will most likely eventually become the defacto standard for the web.

Characteristics of XML

XML is based on the same parent technology as HTML. The XML language though was specifically designed to better handle the task of managing information for the growth the Internet now requires. XML was specifically designed to accomplish tasks which HTML was incapable of handling. As librarians and information scientists, XML is an incredibly powerful system for the future of managing information and very important for the structures of online documentation services. To mention some of the specific differences between HTML/XML/SGML, Elizabeth Castro points out, “HTML lets everyone do some things, XML lets some people do practically anything and SGML lets a few people do everything.”[3] The thing to note about this quote is that as one moves from HTML to XML to SGML the languages get more powerful but the learning curve also increases exponentially. This is not to say that learning XML is only for computer scientists, but one should know some HTML to learn XML.

 

The Power of XML

 

            HTML tags are mostly formatting oriented and it is harder for different browsers to display consistency among different implementations of HTML. HTML tags do not give specific information about the content of a web page but concentrate mostly on formatting concerns. Rather than just creating web pages, XML is a language for creating other languages The power of XML lies in the fact that you may use XML to design your own custom markup language and then use that language to format your documents. Strictly defined, XML is a grammatical system for constructing custom markup languages. Librarians might want to use XML to create a language for describing, genealogical, mathematical, chemical or business online catalogues and large amounts of online documents. As library managers, there is a need to be technologically aware. On this account, future technologically informed librarians might one day have to oversee a particular type of online catalogue for an engineering, science or business library. XML allows a grammatical system for these libraries to create a customized markup language.

XML Coding

Specific to XML, the tags in the XML language describe the contents they enclose. The tag structure is similar to HTML with tags, attributes and values. What follows is a specific example of XML code. If one was a Wildlife Conservationist running a larger online library or database, one might want an EndangerSpeciesML for an Endangered Species online catalog. This would be written in ESML or the Endangered Species Markup Language. The following bit of code shows actual XML code (See appendix 1 for full XML page code printout)[4]

- <endangered_species>

- <animal>

  <name language="English">Tiger</name>

  <name language="Latin">panthera tigris</name>

- <threats>

  <threat>poachers</threat>

  <threat>habitat destruction</threat>

  <threat>trade in tiger bones for traditional Chinese medicine (TCM)</threat>

  </threats>

  <weight>500 pounds</weight>

  <length>3 yards from nose to tail</length>

  <source sectionid="120" newspaperid="21" />

 - <subspecies>

  <name language="English">Balian</name>

  <name language="Latin">P.t. balica</name>

  <region>Bali</region>

  <population year="1937">0</population>

  </subspecies>

- <subspecies>

  <name language="English">Javan</name>

  <name language="Latin">P.t. sondaica</name>

  <region>Java</region>

To note here, the tags describe the content they enclose: <endangered_species>, <animal>, <threat> <weight>, <subspecies>. Also,the XML code is not as lenient as HTML. XML demands careful attention to upper and lowercase letters, quotation marks, closing tags and other things ignored by HTML

XML's Helpers

 

         There are several sister technologies or subcomponents of XML that harness XML’s power. XML Schema language is used to define the custom markup language that your create with XML. XLST or Extensive Stylesheet language –Transformation, lets you extract and transform the information on a page into any shape you need. XLST is used to present Metadata summaries and full versions of the same document and commonly used to convert XML to HTML. XPATH is a system for identifying different parts of the document. Finally XLINK and XPOINTER are future Technologies not yet fully implemented fully by current browsers

            This problem of implementation leads to a good place to discuss XML in the real world of the WWW. Practically, the problems with XML surround the fact that only the latest versions of both Browsers support the core of XML and no browser supports XLINK and XPOINTER. An intermediary solution that many developers have chosen to adopt is to use XSLT to convert between HTML and XML or write their code in XHTML. XHTML uses XML’s stricter rules to write with HTML. Critics of this method though, say that this defeats the purpose of having a separate language like XML.

            In terms of tools for writing XML, several standard text editors can be used such as Simpletext on Macintosh, Notepad or WordPad for Windows. The popular web editor Dreamweaver is written in XML and supports XML coding and there is also an XMLwordpad for Windows.

            If one does want to begin creating his or her own XML or Personal Markup Language one needs to begin with a DTD, Document Type Definition or XML Schema. Returning to the Wildlife Conservationist Example or ESML, Endangered Species Markup Language, this would be a set of grammatical rules that define the larger elements of the Markup language (i.e. <animals>, <subspecies>, <population>, <threats>.

Future Directions

Future directions of the web involve the further development of XML and currently surround implementations of XLINK and XPOINTER and an ever-constant stream of new evolving developments. Steve DeRose, a computer engineer at Brown University points out that much of the future applications of XML will be linked with online libraries and electronic documents.[5] Brown, the Inventor of Xlink, was also an electronic book and online libraries pioneer. Early on, he co-founded Electronic Book Technologies, and designed and built DynaText, DynaWeb, and other products widely used for electronic documentation delivery, electronic libraries and electronic librarians. De Rose comments, “I work with electronic documents: those whose primary existence is their computer form rather than paper. Currently, this involves mainly XML”.[6]

            XLINK and XPOINTER are two promising future directions for XML and the Web. XPOINTER –allows you to link to any part of a document that you specify.

XLINK provides Multi-directional links. Traditionally HTML only provides for one-way links; using the 'Go Back' button on the browser was the only way to return to the location of the original link. With XLINK, users can return to the original location via a corresponding link at the first link's destination. XLINK also provides links to multiple destinations. Users may be able to choose between different destinations from a single link. Other benefits include placing content inline from a linked document.

            Finally, XML designers have a future vision of combining the WWW and databases with the concept of ‘linkbases’. Like databases, these linkbases may be used for organizing link locations. Currently HTML links rely upon fixed machine and file system addresses to find information. XLL provides the framework for link databases to store these addresses. When kept up-to-date, link databases will free HTML publishers from maintaining frequently changing link locations. All in all, these future developments of web languages provide a new myriad of possibilities for the web and the future of Internet communications.



[1] Official Body of the World Wide Web on the Current and Futures of languages. http://www.w3.org/W3C. July 3, 2002.

[2] XML and SGML Introductions. Oasis. http://www.oasis-open.org/cover/. July 3, 2002.

[3] Castro, Elizabeth. XML for the World Wide Web. Berkeley: Peachpit Press, 2001. p. 11.

[4] This code is adapted from the exercises found in Castro, Elizabeth. The Visual Quickstart Guide: “XML for the World Wide Web. Berkeley: Peachpit Press, 2001. pp 5-32.

[5] DeRose, Steve. Personal Home Page. http://www.stg.brown.edu/~sjd/index.html July 3, 2002.

[6] DeRose, Steve. Ibid., 2002.