Archive for July, 2009
More on Motivation for Investment in Implementation of a MODS Ontology
Posted by admin in Library World, Linked Data, Standard on July 31st, 2009
In my previous post Another Step Toward Lifting Library Metadata into the Cloud I offered a partially developed MODS ontology and an approach intended to assist others who might have an interest in criticquing, extending, modifying, or simply commenting on such an ontology. However, one of the first comments on my posting brought to my attention another library oriented ontology, Bibliontology and the poster later essentially asked, “Why a MODS ontology.” Without directly answering that question, I would like to reiterate that my
Primary Motive is to:
Promote further discussion on identification and eventually implementation of a community acceptable strategy for migrating existing MARC based metadata into a form more universally accessible to consumers and producers of Linked Data.
So what are some of the paths we might take for migrating MARC to RDF, presuming that RDF is the preferred way to express Linked Data?
Well, there is already significant support for converting MARC to MODS (see the “Tools & Utilities” section of the Library of Congress Document: MARCXML) and a Draft RDA to MODS mapping. So MODS is at least one natural intermediary candidate to target for promotion into the Linked Data cloud. Furthermore, establishing a community acceptable “standard” approach for expressing MODS in RDF formally benefits from the existence of a MODS ontology, which will define a common vocabulary for inserting MODS triples into triplestores, and which can aid in the formulation of “natural” (hopefully relatively simple and easy to understand) SPARQL queries.
In fairness, it should also be noted that:
- There is also already a way to go straight from MARC to a limited Dublin Core based RDF implementation
- a Google search on MARC + Ontology produces many results
- Significant work has been done on MarcOnt
- And a Google search on MODS + ontology does produce some results (now including some references to my own postings
).
Still after some searching, it has not been clear to me that a full MODS ontology yet exists. By that I mean one that fully captures all the details of the MODS XML schema.
The above is a bit of a global perspective on why others (especially other MARC producers from the library community) might be interested in a MODS ontology. The following is more of a local perspective on the point of developing a MODS ontology.
In my work context we are required to produce MODS. I won’t go into total detail about all the reasons for this, but essentially they include:
* UC San Diego, and the UCSD Libraries, where I work, is part of the larger University of California system.
* The libraries of the UC system are centrally served by the California Digital Library (CDL)
* CDL provides a shared Digital Preservation Repository (DPR) service
* DPR requires deposited content to be accompanied by METS
* Deposited METS files must conform to predefined “profiles”
* Acceptable candidates for the “descriptive metadata” component of the METS profiles are primarily either MODS or Dublin Core
* Our local cataloging staff has already invested significant work in full MARC cataloging of hundreds of thousands of objects that we want to send to the DPR and they don’t want to “dumb-down” to Dublin Core. MODS therefore becomes the preferred choice metadata expression for incorporation in METS.
* Then, because there are automated mechanisms for generating MODS from our existing MARC encoded data, producing MODS is something that we’ve known how to manage and and have been able to implement on a mass scale for several years now. So, to state it simply, like it or not, we already have lots of MODS data that we need to work with.
Just to tell a little more of our local story:
A couple of years ago we started to work on building a digital library. We looked at open source products like DSpace, Fedora, and others, but one of the limitations we encountered was lack of support for metadata with the richness of MARC or MODS. Also, we were not aware of any obvious established, extensible relational database schemas for dealing with the complexity of MARC or MODS and XML database products didn’t seem to perform well at the time. So, we took a leap and began exploring ways to encode our MARC data in RDF for access from a triplestore via SPARQL. Since we already had MODS, it was natural for us to try and find a way to express that in RDF. Starting only with the notion that “subject” and “predicate” should always have URIs (URLs), and because we:
- Couldn’t find an already existing MODS ontolgy
- Were not sophisticated enough to create our own
- And (in a way unfortunately) didn’t really think just to reference the existing Library of Congress MODS XML schema
we created individual files for each of the predicates we needed. We used the same CDL defined ARK based file naming convention for these predicate URLs as we did for our actual digital content files and then created a mapping between the ARKs and MODS vocabulary elements
Thus, for example, the URL for “mods:title” for us is:
http://libraries.ucsd.edu/ark:/20775/bb72705143
Note: Unfortunately our system is not available to the public so this link will not generally work for everyone
where the “http://libraries.ucsd.edu/ark:/20775/” prefix component is constant for all other MODS predicates.
Armed with this approach, we encoded data for several hundred thousand MARC –> MODS records to RDF and loaded on the order of 15 million triples into AllegroGraph, which, thanks to vendor licensing terms, we were allowed to use at no cost as long as we were working with less than 50 million triples.
The following are some examples of what the user interface for our system displays: (download and zoom in on the files for closer viewing, if you like)
RDF triples
JSON view of data
JSON manifestation of data for processessing by client-side JavaScript
RDF XML file
RDF graph
All this works for us and we are able to do SPARQL queries on the results. We have had thoughts about sharing more of our work with others, but have been painfully aware that we are missing anything like a candidate for a community shareable MODS ontology that would enable others to generate RDF for their MODS records in a way that would potentially allow us all (i.e. those starting with MARC data) to make our catalog records available as consistently encoded linkable MODS data.
We have wanted to fill that gap by beginning to encourage some community discussion about a MODS ontology that we could eventually migrate our own data and software towards.
So, again with that end in mind, my previous posting offers a partially complete MODS ontology candidate along with a visual aid assisted methodology to help in the validation of that ontology as it is assembled in a sequential layer-like fashion from increasingly large subsets of the complete body of statements which define the full ontology.
Another Step Toward Lifting Library Metadata into the Cloud
Posted by admin in Library World, Linked Data, Standard on July 22nd, 2009
![]()
![]()
![]()
![]()
This post is the beginning of what will eventually be a longer and more complete entry describing my effort toward, and reasons for attempting to create a MODS ontology.
First, a simple statement of the reason for attempting to create a MODS ontology.
Long-term goal: To help open the door for the vast quantity of the world’s MARC formatted librarian created metadata to migrate into the Linked Open Data space.
I would hope that the existence of an established and accepted ontology for MODS could enable the conversion of MARC to MODS to RDF in such a way that would facilitate ingestion of that data into triplestores in a manner that would lead to support of “natural” SPARQL query formation. Unlike the XML schema representation of MODS, an OWL-based expression of MODS could also provide an ontological base for asserting equivalent (owl:sameas) relationships to other ontologies, and other set-theoretic assertions about MODS elements. These additional benefits of a MODS ontology could, hopefully, help to establish richer potential for library metadata to be integrated and queried (via SPARQL) with other Linked Open Data sets.
Additionally, see my follow-up post, More on Motivation for Investment in Implementation of a MODS Ontology
As for my efforts toward developing such an ontology, the sequence of images below is intended to serve as a visual aid for comprehending the increasing complexity of structure involved in implementing an RDF/OWL ontology based representation of the MODS XML schema discussed at: http://www.loc.gov/standards/mods/ and defined in detail at: http://www.loc.gov/standards/mods/v3/mods-3-3.xsd
Each succeeding image builds on the previous one in the sequence by adding representations of a new set of statements about the MODS ontology to the previous level. The sets of statements chosen for addition at each new level of the progression are selected so as to keep the overall structure of the representation “tree-like” for as long as possible, only introducing visually complicating overlapping (one-to-many) relationships in later levels, after most generally “tree-preserving” visualizations have been exhausted.
The purpose of matching the increasingly complex visualization sequence to an increasingly complete set of RDF statement from which each visualization was derived, is intended to assist reviewers in understanding and verifying the accuracy and completeness of the final full set of statements which comprise the full ontology being offered for consideration.
Thus, each image level corresponds to an increasingly large subset of the entire set of RDF statements about the MODS ontology and has been produced by the open source graph visualization program Cytoscape (available for free download from the main Cytoscape site). A table of RDF (subject/predicate/object) statements is associated with each image, and each image was produced by importing that set of statements into Cytoscape. The full spreadsheet is also available as: Excel File and Google Doc.
Furthermore, associated with each image and data table level is a corresponding already imported Cytoscape (.cys) file which may be downloaded, viewed and manipulated directly within Cytoscape.
Click on any of the images for larger versions.

Level 1
This graph clearly demonstrates the connection of the 20 top-level MODS elements to the “modsGroup” center, which is in turn connected to the most general “Owl:Thing –> ModsCollection –> Mods” hierarchy.
level.01_statement_table
level.01.cys

Level 2
Adds another level of class structure to top-level mods elements that require it.
level.02_statement_table
level.02.cys

Level 3
Adds unique literal (owl:DataTypeProperty) values to appropriate locations in the MODS ontology structure.
level.03_statement_table
level.03.cys

Level 4
Adds remaining components that preserve the pure tree structure of the graph, including: repeated groups of predicates, such as: Xlink and LanguageGroup, plus enumeration classes, enumeratio values, and a few subclasses of owl:Thing.
level.04_statement_table
level.04.cys

Level 5
Adds only eight new statements that are brought in specially now because they begin to significantly alter the pure tree-like structure of the previous graphs. In other words, they introduce new branches to already connected nodes, and thus begin to introduce noticeable new complexities in the graph.
level5_statement_table
level.05.cys

Level6
Adds numerous repeated subClass relationships, identifying which items are:
Date, LanguageGroup, Xlink and Enumeration. Adding these new nodes and branches significantly complicates the visual appearance of the graph, since these many-to-one relationships introduce crossing branches. However, by bringing them in in this late order has allowed us to preserve some visual clarity of structure and thus better understand and be able to verify the completeness of the build-up of the ontology to this point.
level.06_statement_table
level.06.cys
Note a couple other less useful, but possibly interesting patterns that can be derived from the master spreadsheet. These are provided mostly to stimulate the imagination regarding the kinds of patterns which might be observed with this technique:
In addition to the as yet incomplete upload of data and graphs above, the following is little more than an outline of areas that remain to be completed in the rest of this article.
Key issues to be considered and perhaps debated:
- Why is the existing MODS XML schema http://www.loc.gov/standards/mods/v3/mods-3-3.xsd not enough?
- MODS predicate naming and labeling
- Handling of repeated names used in multiple contexts
- Use of other vocabularies besides, or in addition to OWL (OWL 2, SKOS, etc.)
- Need for a demonstration SPARQL endpoint
- Necessity for completeness of the ontology
- Naming of class/structure components (owl:ObjectProperty) may be less significant than naming of literal properties (owl:DatatypeProperty)
My currently most complete OWL MODS ontology is downloadable from here.
An Example MODS Record expressed in the candidate ontology
Not yet available
Tools Used:
- TopBraid Composer
- Cytoscape
- Protege
- GraphViz
- Altova SemanticWorks
- Microsoft Excel
Basic list of owl:DatatypeProperty (literal) predicates defined in this MODS ontology:
Text outline of MODS structure
References and Resources :
- 2009 Semantic Technology Conference – Digital Library Session
- XML Schema Tutorial
- MODS
- MARC
- Linked Data
- TopBraid Composer User Group
- Can Bibliographic Data Be Put Directly Onto the Semantic Web?
- Expressing Dublin Core metadata using the Resource Description Framework (RDF)
- MARC Must Die
- What I’ve Changed My Mind About – (Roy Tennant on “MARC Must Die”
- A Bibliographic Metadata Infrastructure for the 21st Century
- Cataloging Futures – RDF
- Karen Coyle’s RDA presentation at Code4Lib
- R&D: RDA in RDF or: Can Resource Description become Rigorous Data?
- Karen Coyle Answers Martha Yee’s questions
- RDF and RDA: declaring and modeling library metadata
- LinkedData (Code4Lib)
- Simile RDFizers/marcmods2rdf
- MARC or MODS to RDF (with links to examples)
- MODS2RDF transform
- Welkin – graph-based RDF visualizer
- RDF Gravity (RDF Graph Visualization Tool)
UCSD implementation and utilization of MODS in XDRE/DAMS/PAS
Acknowledgements:
I would particularly like to thank Gokhan Soydan from TopQuadrant for his assistance in making use of the TopBraid XML Schema Importer which, although not 100% automatic, was extremely helpful in providing a concrete example of how XML schema language might be transformed into OWL. Gokhan has indicated that he expects future version of TopBraid Composer to more completely handle conversion of XML Schema construct to OWL. The current version (1.3.0) which I used did most of the job, though it took me a while to realize that because it did fail to process the top-level tag elements which, unfortunately, defines the whole first-level primary structure that bind the 20 top-level MODS elements to the central “modsGroup”.
Use Cases Dictate How You Adopt the Semantic Web
http://www.devx.com/semantic/Article/42350
DevX.com article, starts with the following quote:
“The most widespread—and likely most reported on—Semantic Web technology is a W3C recommendation called RDF (Resource Description Framework). An XML-based language for representing data in knowledge bases, RDF is used in nearly all existing online knowledge bases. But while the spotlight is on RDF, other technologies such as NLP, SPARQL, ontologies, and inference all work in concert to enable the Semantic Web stack.”









Recent Comments