Alternatives to Author-Centric Knowledge Organization

Rama C. Hoetzlein

University of California Santa Barbara

Abstract: This article explores the differences between collaborative and collective authorship, focusing on the most obvious example of the latter, the Internet, and the challenges it poses for knowledge organization. An alternative to current author-centric knowledge systems is presented in the experimental Quanta system.

Keywords / mots-clés: Quanta; Knowledge environment; Representation of text; Authorship; Information studies; Encyclopedias; Language-based systems / Quanta; Environnement de connaissance; Représentation de texte; Profession d'auteur; études de l'information; Encyclopédies; Systèmes basés sur un langage

The INKE Research Group comprises over 35 researchers (and their research assistants and postdoctoral fellows) at more than 20 universities in Canada, England, the United States, and Ireland, and across 20 partners in the public and private sectors. INKE is a large-scale, long-term, interdisciplinary project to study the future of books and reading, supported by the Social Sciences and Humanities Research Council of Canada, as well as contributions from participating universities and partners, and bringing together activities associated with book history and textual scholarship, user experience studies, interface design, and prototyping of digital reading environments.

Rama C. Hoetzlein is an Assistant Professor in the Department of Department of Architecture and Media Technology, Aalborg University at Copenhagen, Lautrupvang 15, Ballerup, DK. Email: rch@umail.ucsb.edu .

Contextual introduction

The Encyclopédie, printed and published in French by André François Le Breton in 1751, was one of the first large scale reference works. Containing 35 volumes, and over 71,818 articles, with over 120 contributors listed, the Encyclopédie was “on a vastly greater scale than any [earlier] English works” (Lough, 1971, p. 3). The primary author, Denis Diderot, originally estimated completion in 1754, but the work was not completed until 1772. The central reason for this may be attributed to the immense scope and vision of the work:

The work whose first volume we are presenting today has two aims. As an Encyclopedia, it is to set forth as well as possible the order and connection of the parts of human knowledge. As a Reasoned Dictionary of the Sciences, Arts, and Trades, it is to contain the general principles that form the basis of each science and each art, liberal or mechanical, and the most essential facts that make up the body and substance of each. (Schwab, 2009, para. 2)

Diderot recognized that such a work could not be completed by one man, but that it must involve a "society of men of letters and skilled workmen, each working separately on his own part" (Schwab, 2009, para. 1). The distinction of articles according to fields of study certainly helped the large scale of the collaboration, and is a primary feature of modern encyclopedias (e.g. Encyclopedia Britannica); however, the writing of the Encyclopédie was not without major challenges. Only 20 of the 120 contributors were paid. In a few cases, paid authors fled or produced little work. One author was condemned due to external events. Many authors were paid much less than the amount of work they produced (Lough, 1971). Collaborative authorship always presents a range of problems, mostly more social and human than practical. Nonetheless, notable persons from many distinct fields of study were brought together to complete the work.

Encyclopedias are an example of collaborative authorship, in which the authors generally know one another and have agreed to a strategy for publication and division of labor. Subramanyam (1983) reviews the types of collaboration, and finds that collaboration affects the visibility and productivity of modern researchers. Harande (2001) finds that in certain fields, such as technology, productivity is not necessarily linked to collaborative authorship. We can distinguish this from collective authorship, in which the authors are anonymous, unfamiliar with one another, or have not decided on delineations of labour. Nearly all historic authorship is of the first type, including most reference works, encyclopedias, dictionaries, and scientific papers. The Internet is the first unique example of collective authorship. Considered as a whole, it represents a semi-anonymous, collectively authored, delimited text. The introduction of HTML permits hyperlinked texts in the decentralized space of the web (Berners-Lee, 1999); however, the fact that individual texts are personally owned and maintained causes a delineation of web pages along boundaries of authorship. Thus, publication on the Internet is similar to the publication of an individually authored text. The Internet acts more as a repository for these texts, which is further supported by the fact that the Internet is not a reference work, or condensation of knowledge, but is usually  “searched” for sources.

Collectively authored online encyclopedias, such as Wikipedia, represent a different kind of text. Wikipedia was conceived of as a free online encyclopedia, yet the writing of content was extremely slow as it initially followed traditional models of authorship. Jimmy Wales and Larry Sanger realized that a text based on a wiki, software for online content creation, would allow many anonymous authors to contribute to the same text. Wikipedia now has over 2.9 million articles, with an average of 20 edits per page, while frequently edited articles can have over 2000 edits. Since the average article length does not change significantly,1 each paragraph or sentence is thus contributed by a different anonymous author, making the text highly granular in its authorship.

While the scale of Wikipedia is immense in comparison with other references, it has also been heavily criticized. The number of administrators capable of resolving conflicts or locking pages is only 1,675, far fewer than the number of contributors. In addition, administrators have particular areas of interest but may not be authorities in these fields. A unique phenomenon of Wikipedia is the "revert war," in which two or more authors competitively modify the same content. Viegas, Wattenberg, and Kushal (2004) find that revert wars are not limited to controversial topics and may take many different forms. As they mention, the “neutral point of view” philosophy is Wikipedia's proposed solution, with discussions taking place in a separate space. Contributors may be required to provide citations for each statement, which is often impossible in fields such as the humanities where discussion is more interpretive. One cannot express an individual opinion on a literary work, for example, unless that opinion is supported by a previously cited author. In some cases, it was found (by the author) that citations had nothing to do with the statement, and were provided merely to lend authority. Anonymous, collective authorship thus presents several unique challenges to the creation of reference works.

Knowledge organization

An alternate approach to the collective authorship of tertiary texts may be motivated by the fields of library science and knowledge organization, which extend primarily from the concept of the database. A definition of knowledge organization, provided by the librarian Berwick Sayers (1959), reveals a close relationship to the card catalog:

[knowledge organization is] not only the general grouping of things for location or identification purposes; it is also their arrangement in some sort of logical order so that the relationship of the things may be ascertained. (pp. 1-7)

In knowledge organization, greater importance is placed on the arrangement rather than collation of the primary sources, as with encyclopedias. There may be many books on the topic of physics present in a card catalog, with different views on a particular theory, yet the purpose of the catalog is not to resolve or summarize different physicists’ views, but simply to maintain each interpretation as a distinct text. This approach is in contrast with encyclopedias, which attempt to condense knowledge from many sources.

There are several benefits to data-centric knowledge organization. It is trivial to collectively author a card catalog since there is a one-to-one relationship between a catalog entry and a primary source. A librarian may easily enter a new book without being concerned that the content may overlap or conflict with another similar book. This is also their drawback, however, since a card catalog provides many references but no detailed information regarding the field.

Unlike encyclopedias, a card catalog does not attempt to communicate a summary of human knowledge; rather, it only attempts to organize it. This simplifies authorship, while the drawback is that one must still ultimately browse many primary sources to find meaning. An encyclopedia presents meaningful content, but introduces conflicts of authorship due to the need to simplify and merge original sources that may contain different views. Ideally, we would like a system for human knowledge that balances organization with levels of meaning.

Vannevar Bush (2003), with the hypothetical memex, describes the ideal features of a collectively authored, centralized knowledge system:2

The owner of the memex, let us say, is interested in the origin and properties of the bow and arrow. He has dozens of possibly pertinent books and articles in his memex. First he runs through an article, finds an interesting but sketchy article, and leaves it projected. Next, in a history, he finds another pertinent item, and ties the two together. (p. 45)

Interestingly, this description covers libraries (primary sources), encyclopedias (condensations), and timelines (systematic organization) – different scales of authorship and organization present in a single system. The realization is that human thought is at times broad, in need of summary overviews, yet also specific once details are grasped.

The implementation of the memex, written in 1945, is described as a system of “books of all sorts, pictures, periodicals and newspapers” placed on microfilm, and fitting into a space the size of a desk, so that “the matter of bulk is well taken care of by microfilm” (p. 45). A wonderful idea until one realizes that the US Library of Congress would require a full city block of microfilm itself (as it does), and the entire Internet would need sixteen city blocks.3 With modern storage, however, this is not the primary issue. More challenging is that the means to organize, navigate, collect, and summarize this knowledge presents many theoretical and practical problems, primary among them that each reader, or organizer, will interpret the material differently.

Quanta: A system for language-based knowledge organization

At present, information systems, card catalogs, and journal indices are almost all stored in relational databases. The relational database model became the industry standard in the 1970s largely due to its simplicity and relative ease of use at the time, while the competing model of network databases continued only as research. However, it is well known that relational databases have critical restrictions on their grammatical expressiveness and flexibility (Levene, 1998). Thus, while modern encyclopedias struggle with being too unstructured, leading to tedious organization and conflicts in authorship, knowledge systems tend to be too structured, allowing for only indexical information rather than deep facts.

Quanta is an experimental system designed to be both structured and grammatically rich, thus residing at a middle point between encyclopedia (language) and databases (objects). Facts are represented in Quanta by taking the sentence as a unit of knowledge,4 and the word as its atomic component, so that the language of Quanta consists of phrases with word and phrases separated by vertical bars, such as:

light | has | constant | speed | in | vacuum | CITE | Einstein | ENTRY | J. Smith
life | exists | on | mars | BELIEF | J. Smith | RATING | 2
Hamlet | was | written by | Shakespeare | ENTRY | J. Smith
J. Smith | is an | active user

This structure is similar to the Prolog programming language, based on prepositional logic. Note that the capitalized words delimit second order phrases regarding the statement. The unique aspect of Quanta, however, is that these sentences are embedded in a hypergraph database, providing relational context in addition to grammar (Hoetzlein, 2007). By representing language in this way, it is possible to express complex ideas while at the same time automatically linking these ideas to every other context in which they appear. A search for the word “fish,” for example, is automatically connected via the hypergraph to every context in which fish appears – types of fish, clouds that look like fish, robotic fish, cooking fish – all appear when the idea of fish is explored.

Authority versus filtering

It is interesting to observe that in many Wikipedia articles, the practice of placing references after each sentence is becoming increasingly common. Conflicts between ideas must be continually resolved through careful editing or administrative intervention to express both perspectives while maintaining readability. The authority of administrators has led to a number of criticisms as individuals with strong biases may be frequent editors of a particular topic. The task of determining which ideas should be present is a challenging one. Does an implausible theory by a lesser physicist deserve to be present on a page with well-known theorists? Should personal descriptions of God be allowed on pages about religions? What if they are by notable saints? Should views that man never landed on the moon be allowed on the page for space flight?

The syntactic level of Quanta provides a format that can distinguish between the writer (author) of an idea and the source (citation) of that idea. This helps to contextualize the fact relative to the source of the information, but also according to the biases of the person who offered the fact in the first place. In an online system, in which participants log in, the author can be automatically associated with each idea that is entered.

At a higher level, a novel approach to collective authorship of encyclopedia would enable all points of view to co-exist. The primary reason for a “revert war” is that a biased, anonymous author has the power to delete the ideas of another individual; however, this situation does not occur in reality. No one can erase our beliefs, as much as they might try to do so through persuasion or argument. Similarly, the ideal knowledge system would allow all perspectives to co-exist simultaneously, so that anyone may add an authoritative fact or opinion, regardless of their correctness. Such a system may be especially well suited to children, allowing perspectives from all ages to co-exist alongside authoritative sources on a particular topic. Filtering would allow a grade school classroom to show similar ideas from the same age range, or the researcher to filter only for key sources.

A collective, additive-only text presents several challenges. The first is how the reader may distinguish authoritative facts from incorrect ones. As sentences are atomic, Quanta automatically assigns authorship to every fact entered. Thus, it becomes possible to filter a page based on all facts from a particular authoritative individual or group. To see factual knowledge on astronomy, for example, one could filter based on authors with degrees in this field, or by specific authors. Finally, the system could allow readers to rate the accuracy of any statement, so that collective, statistical agreement on the correctness of ideas may still be recorded. The need for content administration no longer exists, as all authorship is individual and additive.

Another challenge presented by collective authorship is vandalism. In a collective, additive-only text, acts of vandalism due to deletion are eliminated since deletion is not allowed, and the meaning of vandalism is diminished since each fact is automatically tagged with the author at the time of entry. Since Quanta maintains author associations internally, to control vandalism simply involves filtering by these authors. Anonymous log in and authorship is also permitted, with facts tagged as anonymous; in this case, the reader may just as easily view anonymous facts as filter them out.

Providing fine-grained filtering for readers resolves many issues, and more closely mimics our human perception of reality. When we encounter views that go against our beliefs we may filter them out completely (Wason, 2004). This is due to the basic necessity of simplifying our experiences. People attempt to live in places that conform to their personal system of beliefs. Rather than restrict authorship, which results in power issues, precise filtering tools place the control and bias entirely at the reader’s discretion.

The perspective of authorship presented here is that there are no incorrect views, nor completely correct ones, only degrees of authority. In this sense, the text is comparable to a library. No one may remove books, but anyone may add one or check one out. Similarly, in this experimental system, no one writing a text may remove facts provided by others, but anyone may add new facts.5 This strategy of (non-)authority is not maintained by a select group of individuals, but through an automated and systematic process of assigning authorship at the time facts are created, and allowing readers precise control over the filtering of texts. While very similar in spirit and motivation to open authorship on the Internet, the design presented is intended for a system that is logically structured to provide deep organization for general types of information, and to carefully track authorship to simplify navigation and filtering.

Results and limitations

Quanta was designed as a knowledge organization and authoring system to allow different levels of knowledge to co-exist, both systematic (relational) and encyclopedic (grammatical). In practice, this was accomplished with the creation of a custom, non-relational database written in C++, and a larger meta-system for querying, visualizing, and navigating that data. Design decisions for the system may be found in the original work on this project (Hoetzlein, 2007). The initial results were primarily focused on data storage and representation, resulting in an offline prototype database.

Figure 1: Screenshots of Quanta showing a) a timeline filtered to show contemporary sculpture, and b) a circle-packing view of the Linnean ontology for animals

As the data representation of Quanta is both semantic and systematic, consisting of a rich grammar and an interconnected database, it becomes possible to create novel visualizations of content. Several visualizations developed include timelines of arbitrary concepts (“zoomable” on both axes, time and detail), scientific graphs of any two numeric values, circle packing as a two-dimensional mapping metaphor for conceptual ideas, and ontology trees to view individual belief hierarchies (see Figure 1). In each mode, filtering allows different sets of concepts to be viewed. These real-time visualizations demonstrate that facts can be efficiently represented and queried in the structures described here.

Recent work has focused on an online system to introduce the social aspects of additive authorship. The core database has been redesigned to provide the ability to track authorship on individual sentences, along with other types of self-referential information. A working online system is still in development, while the offline prototype demonstrates many of the concepts described here with examples from the fields of computer graphics, painting, philosophy, mineralogy, and chemistry.

A related project is the PReE/ProSE system for social computing, a collaboration between the Social Computing Group at the University of California Santa Barbara and the INKE project at the University of Victoria, which introduces the notion of professional readership as an extended collection of historic persons, contemporary critics, authors, readers, and the associated primary sources surrounding them.

Each of these systems has benefits and limitations. Quanta uses a custom database, and is thus similar to OpenCyc, a database of common sense maintained in Lisp format. Both systems require significant low-level development. While the scope of Quanta is both deep and broad, focusing on all of human knowledge, an online system allowing sentence-level tracking of authorship as described is still in development. Wikipedia is based on a content-authoring system, and is more closely related to document authorship than to databases. PReE/ProSE uses a relational MySQL database, and while currently available, its granularity is not sufficient to permit the range of general knowledge found in Quanta. Despite these differences, each system described makes unique advances in authorship, readership, and organization according to their goals.

Conclusions

With the exception of libraries, in every knowledge organization system available – from print encyclopedia to online sources – the concept of authority restricts knowledge according to the biases of the administrators of that content. This is also true of libraries, as any library must select which materials to include. The Internet itself is the only modern system that allows unlimited addition of knowledge without authoritative control. Yet, the Internet is non-encyclopedic; that is, uncondensed. As suggested here, it may be possible to design future knowledge organization systems which remove central control by providing unlimited additions, like the Internet, but with fine-grained (word level) tracking and filtering tools for authors and readers.

The anonymous, collaborative authorship of large references is a problem unique to the digital humanities, as the challenge of organizing global knowledge is likely to require on-going human intervention in the areas of information science, computation and literature, philosophy, and others. An alternative to current author-centric knowledge systems is presented here, while issues related to the study of control, authorship, and reading practices of digital reference works remain open areas for further research.

Notes

  1. All facts from the Wikipedia History page. (http://en.wikipedia.org/wiki/History_of_Wikipedia). Wikipedia states the average article length is a little over half that of an Encyclopedia Britannica article.
  2. Centralized in the sense of collected knowledge (bringing together), while the underlying storage may be decentralized in the physical sense and/or also decentralized in the managerial sense.
  3. One roll of Microfilm = 0.002 terabytes. Largest electronic storage (2008) = 1000 terabytes. Library of Congress = 25,000 terabytes. Entire Internet (est.) = 400,000 terabytes (estimated by the International Data Corporation, current at time of writing).
  4. Sentences have variable lengths and structure whereas relational records are fixed length with fixed fields.
  5. It may be argued that unrestricted additions would consume large amounts of space; however, the grammatical structure of Quanta allows it to be stored using a dictionary-compression scheme, with words stored as numbers. Thus, the storage needs are significantly less than plain text articles.

References

Berners-Lee, Tim. (1999). Weaving the Web: The original design and ultimate destiny of the World Wide Web. New York: Harper Collins.

Bush, Vannevar. (2003). As we may think. In Noah Wardrip-Fruin & Nick Montfort (Eds.), The new media reader (pp. 37-47). Cambridge: MIT Press.

Harande, Yahya IIbrahim. (2001). Author productivity and collaboration: An investigation of the relationship using the literature of technology. Libri, 51, 124-27.

Hoetzlein, Rama. (2007). The organization of human knowledge: Systems for interdisciplinary research. (Master’s thesis). University of California, Santa Barbara, CA.

Levene, Mark. (1998). On the information content of semi-structured databases. Acta Cybernetica, 13(3), 257-275.

Lough, John. (1971). The encyclopédie. New York: David McKay.

Sayers, W.C. Berwick. (1959). A manual of classification for librarians and bibliographers. London: Andre Deutsch.

Schwab, Richard N. (Trans.) (2009).  Preliminary discourse to the encyclopedia. The encyclopedia of Diderot & d'Alembert: Collaborative translation project. URL: http://quod.lib.umich.edu/d/did [August 15, 2009].

Siemens, Ray, Haswell, Eric, Watson, Gerry, McColl, Alastair, & Armstrong, Karin. (2006, September 14-17). Integrating tools into professional academic processes: A first look at the Renaissance English Knowledgebase (REKn). Conference proceedings from Bringing text alive: The future of scholarship, pedagogy, and electronic publication. Ann Arbor, MI: University of Michigan, Rackham Graduate School.

Subramanyam, K. (1983). Bibliometric studies of research collaboration: A review. Journal of Information Science, 6(1), 33-38.

Viégas, Fernanda, Wattenberg, Martin, & Kushal, Dave. (2004). Studying cooperation and conflict between authors with history flow visualizations. Proceedings from ACM SIGCHI special interest group in computer-human interaction. Vienna, Austria: ACM.

Wason, Peter Cathcart. (2004). Confirmation bias. In Margit E. Oswald & Stefan Grosjean, (Eds.), Cognitive illusions: A handbook on fallacies and biases in thinking, judgment and memory (pp. 79-96). Hove: Psychology.


CCSP Press
Scholarly and Research Communication
Volume 3, Issue 3, Article ID 030125, 9 pages
Journal URL: www.src-online.ca
Received August 17, 2011, Accepted November 15, 2011, Published September 1, 2012

Hoetzlein, Rama C. (2012). Alternatives to Author-Centric Knowledge Organization. Scholarly and Research Communication, 3(3): 030125, 9 pp.

© 2012 Rama C. Hoetzlein. This Open Access article is distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc-nd/2.5/ca), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.