Saturday, October 21, 2017

Taxonomies for Specific Business Needs

Designing controlled vocabularies to meet specific business needs was the topic of my latest conference presentation at Taxonomy Boot Camp London on October 17. There are two aspects to this topic: (1) the type of controlled vocabulary to choose, and (2) whether to have the same controlled vocabulary or distinct controlled vocabularies to serve different business needs.

For choosing the type of controlled vocabulary, the most common choices are a thesaurus, a hierarchical taxonomy, or a faceted taxonomy. It is also possible to have some kind of combination or hybrid type of these. I’ve discussed the difference between taxonomies and thesauri in previous blog posts, “Taxonomies vs. Thesauri”  and “Taxonomies vs. Thesauri: Practical Implementations.”
So, now I will focus on whether to have the same controlled vocabulary or distinct controlled vocabularies to serve different business needs.

What are different business needs? Taxonomies may be needed to make different kinds of information organized and easily searched or discovered and retrieved by different users, including:
  • Internal documentation, including policies and procedures, market and product research, etc.
  • Digital assets for content managers to reuse in publishing content 
  • Product information, such as a product catalog for ecommerce, for customer
  • Curated, premium content for subscriber
  • Informational content for the public
 While different organizations have their own needs, the same organization could have more than one business need for a taxonomy, such as an internal use and an external customer-facing use. An organization in the business of publishing content, may even have quite different published products for different users and purposes and consider each of those as separate business needs.

Taxonomies are versatile, so it is possible and worth considering having a single, master taxonomy serve all business needs, with terms classified for different uses. Terms managed in a taxonomy management system can be tagged with a category assignment as to which use they are for, such as some for an internal use, some for an external, and perhaps many of them for both.  You determine the type of category, let’s say “audience”, determine what audience types there are, such as “internal” and “external,” and set that up in the categories option of your taxonomy management software. Then you assign the categories, as appropriate to each term. This method works if the same terms and the same structure are being used in both cases, with one use having more specific terms in some areas. The other use may have less specific terms, or also more specific terms in other areas.

The method of using the same taxonomy for different uses, designating use by categories on terms requires
  • that when a concept in the taxonomy has more than one use, that the same preferred term label is used for the concept in both/all cases
  • that concepts/terms in the taxonomy have the same relationships to each other
Sometimes, however, different business needs require different preferred labels for a concept, such as “Customers” vs. “Clients.” It is possible to maintain multiple preferred labels for a concept, if you manage them as you would manage multiple language versions of a term in a multilingual taxonomy, but this is more complexity than necessary when only some of the terms have different preferred label.

If you want to maintain links between equivalent terms, whether they have the same preferred labels or not, in different business-use versions, it’s not necessary to maintain them in the same taxonomy akin to multilingual versions. Rather, if you created two separate taxonomies, you could still set up inter-taxonomy links between the equivalent terms. This is not necessary, but might be desirable.

Whether to maintain one or more taxonomies also depends on the size of each. If one of the business use cases requires only a small taxonomy, of a couple hundred terms or less, then it is not too much trouble to maintain distinct taxonomies for each.

Saturday, September 30, 2017

Vocabulary Management Issues

Issues in Vocabulary Management” is the latest Technical Report (TR-06-2017) published by the National InformationStandards Organization (NISO), approved on September 25, 2017. I had the honor of serving on its working group, specifically on its subgroup for Vocabulary Use/Reuse.

The most significant NISO publication for controlled vocabularies is ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, which is referenced several times in TR-06. ANSI/NISO Z39.19 focuses on how to design and create controlled vocabularies (especially thesauri and taxonomies), whereas TR -06 addresses issues in the use of controlled vocabularies. Furthermore, as a Technical Report, rather than a Standard, this 49-page document does not contain requirements, but rather serves an informative purpose. It does have a page of recommendations, though, which are for a vocabulary’s definition and attribute types, its best practices for documentation, and its licensing or provisions for use and reuse.

Over time, the need to create new controlled vocabularies from scratch diminishes, as more vocabularies come into existence, especially those that are made available for sharing or licensing (see my blog post Directories and Databases of Published Controlled Vocabularies) but the need to maintain, revise, and reuse them grows, so this Technical Report serves a valuable role.

What are the “issues” in vocabulary management? They could vary, based on the organization and implementation, but this document considers three areas of

  • Vocabulary use and reuse, dealing with permissions, licenses, maintenance, versioning, extending and mapping vocabularies.
  • Vocabulary documentation, dealing with governance issues and how to document vocabulary properties.
  • Vocabulary preservation, dealing with issues of abandoned or “orphaned” vocabularies, which is especially the case of vocabularies developed by nonprofit organizations which have lost their funding to maintain them.

These issues are relevant to both proprietary controlled vocabularies, which may be reused through licensing agreements, and publicly available vocabularies, which are shared and reused increasingly through linked data on the web, or more specifically the Semantic Web and the Linked Open Data environment.  For publicly available or open vocabularies there are also the issues of simply finding or discovering suitable and sustainable vocabularies and evaluating them and then the communication between the vocabulary owner and user.

TR-06 takes a somewhat broader view of “vocabularies,” not just “controlled vocabularies,” but also including ontologies, unstructured term lists, terminologies, synonym rings, etc. I explored these differences and definitions in detail in my blog post Vocabularies and Controlled Vocabularies, which I wrote shortly after starting work on the NISO working group. The vocabularies of concern of TR-06 also include element sets, which comprise metadata properties/fields and not merely the controlled vocabulary terms/values within those properties.

TR-06 does not seem so much as a “technical report.” It also includes several real-life examples and use cases. To a certain extent, it explains by example.  Appendices include a glossary of terms with extensive definitions; a descriptive list of vocabulary directories, repositories or collections (something that I worked on); a list of free and open vocabulary tools (far more extensive than those I described in a previous blog post Free Taxonomy Management Software); and a list of additional resources with links, besides its bibliography, making this quite a valuable resource.

TR-06 “Issues in Vocabulary Management” will now be added to my list of recommended resources for controlled vocabulary and taxonomy management, and I hope that many of those who manage taxonomies will take a look at it.

Tuesday, August 29, 2017

Taxonomies in SharePoint

Controlled vocabulary metadata, including hierarchical taxonomies, has been supported in SharePoint since its 2010 version, and its use and features have been enhanced is succeeding versions of SharePoint. While it’s not technically difficult for users to create taxonomies and apply their terms to content items in SharePoint, developing a metadata/taxonomy design and application strategy is definitely a challenge.

The distinction and overlap between metadata and taxonomies was the topic of my previous blog post, "Metadata and Taxonomies," and it is very relevant to SharePoint. Controlled vocabularies or taxonomies used to tag content are referred to in SharePoint as “managed metadata.” This designation is indeed accurate and fitting. Some, but not all, metadata is in taxonomy form (hierarchical structures), and in SharePoint it is managed/controlled in a central way, where permissions on who can change or add to the metadata terms may be limited to a smaller set of users than those who may tag content with the metadata. “Managed Metadata” is something you will hear about in documentation, but in the SharePoint application itself, what you want to work with is “Term store management,” grouped with other Site Administration settings under “Site Settings” (under the gear symbol in Office 365 SharePoint). Terms are grouped into “Term Sets” (top-term hierarchies or facets).
Questions to consider in taxonomy design in SharePoint include:

  • To what extent will document libraries (virtual folders) be used to categorize content within a site, and would proposed subfolder names be better suited as metadata terms for tagging?
  • Will the primary use be for filtering lists of documents in place, within an open document library, based on metadata selected for the various “columns,” or will the primary use be for refining search results, based on metadata selected in the left-hand margin refinement panel after executing a search?
  • How many Term Sets should be created and how many and which metadata fields in total should display to the users, either in columns or as search refinements.
  • When should a Term Set be a flat list and when should it be created as a hierarchy, and how deep should the hierarchy be?

Use of document libraries vs. metadata tagging 


SharePoint supports the creation of a hierarchy of nested folders within libraries within sites. So, it may be tempting to start of creating such a “taxonomy” of categories for content, especially if migrating content over from a shared drive where such folders had been used. However, tagged metadata has many advantages over categories of folders for finding and retrieving content.
A content item may be tagged with more than one term from the same Term Set if it deals with more than one topic or if it falls into more than one category type, whereas putting a document in more than one folder can lead to version control issues. (It’s true that you can put a document in one folder and a link to it from within another folder, but this is not easily remembered to do, nor does it look as “clean.”)

You can create Term Sets (as facets) each for multiple ways to categorize, such as by document type, function, audience, topic, etc., serving as facets, and then tag a content item with terms from each, whereas folders don’t deal well with mixed methods of categorization, and you are forced to choose one method of categorization.

  • Tagged metadata allows you to filter a large set of content in place to quickly narrow results to what you want, whereas folders require clicking down through multiple paths, taking more time to find desired content, which is in different places.
  • Tagged metadata can also be implemented as search refinement filters, also known as faceted search.
  • Tagged metadata terms can have synonyms, helping users find what they want by different names, whereas folder names cannot have synonyms. 

Thus, what had been labels for folders on a shared drive should most likely be changed to terms in taxonomy Term Set. Whether you should have any document libraries, or just a few without subfolders, depends on the preferences of your users, but I don’t recommend the creation of subfolders. 

Filtering on columns vs. refining searches


The same Term Sets may be used for both column filters and search refinements. But typically, the implementation of managed metadata in SharePoint is either primarily for one or the other purpose, and the other use may not even be set up. Generally, if the managed metadata is going to be applied to documents within a single library or one site, on the order of tens or hundreds, then column filters are desired; if the managed metadata is going to be applied across multiple sites on thousands of documents, then search refinements would be used.

If unsure whether to promote filtering on columns or refinements on search, consider that filtering columns will always get more accurate results, but metadata has to be consistently applied.  Out-of-the-box search in SharePoint will retrieve documents with the word or phrase anywhere in the document. The idea behind this is to get search set that is larger than needed and not miss anything, and then the user can refine the search result with the various refinements. So, the results of search are not as accurate, but there will be results, even if metadata tagging is incomplete.

Knowing how the Term Sets will be used can have an impact on the wording of terms and the extent of use of hierarchy. Both columns and refiners have limited width for term names to display. The user can easily adjust column widths to accommodate long names, but the refinement panel width cannot be widened by the user. The use of columns also makes it desirable to keep terms to a limited length within a given Term Set.  Refiners indicate hierarchies of terms by default to the user who is searching content, whereas columns do not indicate any hierarchy in the default view.

Number of terms sets and metadata fields


There is no point in creating a Term Set if it’s not going to be displayed to the users for filtering or refining, and too many metadata fields take up horizontal or vertical space, are a burden to tag, and make the user experience of searching or filtering too complicated. So, you need to consider what would be truly useful, and not merely possibly nice to have. Just two Term Sets, such as Document Type and Topic, may be sufficient. In addition to the managed metadata that you create, there will be other metadata fields desired for filtering or refining, such as date, author, and format type, and perhaps uncontrolled keyword tags applied by users. In the case of columns, there will always be the document title taking up a column and considerable horizontal space as well. 

The default columns in SharePoint are “Type” (file format) “Name” (filename), “Modified” (the date the file was uploaded or the last time any of its properties were updated, not the date the file itself was modified) and “Modified By” (the person who uploaded the file or last updated its properties, but not necessarily who actually modified the file). The default search refinements are the same, excluding title: “Result type” (file format), “Author” (an even worse misnomer for Modified by”), and “Modified date” (often displayed in a graph form). If you believe such information is not that valuable, you can remove these columns/refinements, especially when you plan to add other columns/refinements, which will take up horizontal or vertical space. 

I would recommend no more than 4-7 total metadata fields, including those that are not based on managed metadata. You should avoid having more metadata as columns, along with the document titles, than can fully display horizontally, so as not to require horizontal scrolling. Search refinements, on the other hand, by default display sample high-use terms under each refiner, so typically no more than three refiners display in the left margin without vertical scrolling. Vertical scrolling is expected and acceptable to a limited degree.

Term Sets as flat lists or hierarchical taxonomies


SharePoint makes it easy to create hierarchies within Term Sets by simply right-clicking on a selected term and selecting “Create Term” from the context menu. Some people might thing that since the Term Store is for taxonomies, and taxonomies are hierarchical, hierarchies should be created if applicable. However, hierarchies are only helpful for navigating the taxonomy if the taxonomy is sufficiently large. If you set up multiple Term Sets, each used as a facet in combination with others, they each don’t need to be very large. Furthermore, the types of content most people store in SharePoint tends not to need extensively large and detailed taxonomies as might be needed in a content management system or digital asset management system

My rule of thumb is up to 12 terms should be on the same level before considering creating any hierarchy, but it could go up to 20 or so, and even more if the list of named entities/proper nouns.  Also, if you do have hierarchies, consider keeping them relatively shallow, such as to only two levels, instead of three. Even if a hierarchy is technically correct, it does not mean you have to set it up that way.

If you need only a short flat list of terms, you might consider not using the Term Store at all, but rather create the list as "Choice" type of column. This is easier to implement, but the terms would limited in their use to filtering and sorting the column, and could not also be applied to search and navigation. 

Monday, July 17, 2017

Metadata and Taxonomies

Metadata and taxonomies are related. In The Accidental Taxonomist, 2nd edition (pp. 15-18), I explain that most, but not all, taxonomies (not purely navigational taxonomies) serve to populate terms/values in metadata fields/elements; and some, but definitely not all, metadata fields are populated by terms/values from controlled vocabularies or, more specifically, taxonomies (in contrast to free text or key words).

The question remains whether to start with creating the overall metadata strategy and schema and then build taxonomies as part of it as needed, or to start with creating a taxonomy and then, in the process, identify the various descriptive metadata.  Ideally the two are developed for implementation combination, as part of an integrated strategy. However, an expert in taxonomy development (a taxonomist) and an expert in metadata design (a metadata architect) are usually not the same person.

A metadata architect can become an accidental taxonomist, and a taxonomist can become an accidental metadata architect, or the two experts can work together on the same project, although it is not so common for an organization to have both such experts on staff.  Whether an organization has a metadata architect or taxonomist depends on the nature of the organization’s content and content organization needs.

Organizations that start with the metadata expertise and approach to information management tend to be those with significant needs in digital asset management (with image or other media collections), records management (in highly regulated industries), publishing, or cultural preservation (museums or libraries). Organizations that start with the taxonomy expertise and approach include product or service providers, distributors and retailers (especially in ecommerce), and organizations focused on providing information resources.

A hierarchical taxonomy can be integrated with metadata, when one of the metadata fields is for “Topic” or “Subject,” and there is a hierarchical taxonomy of subject terms associated with that field. However, it is the faceted type of taxonomy in particular that unites the tasks of taxonomy creation and metadata design.

Faceted Taxonomies and Metadata

A faceted taxonomy comprises a set of facets, each an individual controlled vocabulary, whose terms are generally not linked/related to terms in the other controlled vocabulary facets, but the combination of terms from a combination of facets are used to tag the same set of content, and users search/filter on terms in combination from various facets. Examples of facets may be Product/Service, Market Segment, Location, Document Type, Supplier, etc. A faceted taxonomy is a common type for both enterprise taxonomies and ecommerce or product review taxonomies, and it’s a type of taxonomy that taxonomists are familiar with creating. It’s called a “taxonomy” even though it differs from the classical hierarchical “tree” type of taxonomy, because it involves controlled vocabulary and classification. The name for each facet and the terms within the facet constitutes a simply two-level hierarchy.

Each facet is also a metadata field/element. The taxonomist designing a faceted taxonomy is thus also designing metadata, at least some of it. There are usually more metadata fields to describe the content beyond those which comprise the taxonomy facets. For a faceted taxonomy to best serve the user who is trying to find/discover content based on what it is and what it is about, the number of facets should be limited. (See my earlier post "How Many Facets.") Metadata, however, can serve additional purposes beyond helping users find content. Metadata may describe content for purposes of full identification, source citation, or information on how the content can be used, including rights data.  The taxonomist or metadata architect needs to decide which metadata fields will constitute a displayed faceted taxonomy for the end-user to utilize in search/discovery, and which metadata fields will not but will rather display on a selected content record.

On the other hand, there may be additional metadata fields beyond the scope and definition of “taxonomy” that are nevertheless made available to the end-user to filter/refine results alongside the other, taxonomy facets. These could be for author/creator, date, title keyword, text keyword, file format, etc. Sometimes the distinction between taxonomy facet and other metadata in this case is not so clear, such as for Document/Content Type, Audience, or Language, when these fields utilize controlled vocabularies. Due to this overlap and blurred distinction between taxonomy facets and displayed metadata for filtering, it is a good idea to design the taxonomy and metadata together as an integrated strategy.

Sunday, June 18, 2017

Standards for Taxonomies

Since “taxonomies” are rather loosely defined, standards specifically for taxonomies do not exist, but there are standards that are relevant to taxonomies. A taxonomy is a kind of controlled vocabulary, and there are standards for controlled vocabularies. There are also standards specifically for thesauri, a kind of controlled vocabulary with which taxonomies typically share many features. 

Standards serve various purposes. Two leading purposes for standards are:
  1. To ensure consistency and ease of use across different products or systems used by different users.
  2. To ensure interoperability, the sharing or exchange of products/services/information.

Standards for Consistency

Standards aimed at ensuring consistency and ease of use would include buttons on devices, menus in user interfaces, pedals in cars. With such standards, users can expect the same experience from manufacturers or service providers and thus they are able to easily use products or systems from different manufacturers/providers/vendors. In the case of information systems, this kind of standard includes those for the design and style of book indexes and thesauri. These “standards” tend to be guidelines, recommendations, or accepted conventions, and not exactly strict standards, even if issued from a standards body. For thesauri, the “standard” is issued by the NationalInformation Standards Institute (NISO), but it is called a "guideline”: ANSI/NISOZ.39.19 Guidelines for the Construction of Monolingual Controlled Vocabularies. The corresponding ISO standard is ISO 25964 Part 1: Thesauri for Information Retrieval.

These guidelines cover style and form of terms, circumstances for creating the various kinds of relationships between terms, use of notes on terms, etc. They are all about how to create well-formed thesauri with consistent design features that are then easy and intuitive to use. For example, when a user sees that two terms are in a hierarchical relationship, the user understands that the narrower term is a kind of, instance of, or integral part of the broader term, and not merely an aspect of or some other related concept of the broader term. In fact, the end-user of a thesaurus does not even need to know and understand thesaurus principles to be able to make use of a thesaurus to find desired concepts and content.

Standards for Interoperability

The other kind of standards, those aimed at ensuring interoperability, would include standards for size and units of measure, data exchange, and communications protocols. Interoperability standards are important for those controlled vocabularies which are intended to be shared or reused. Thus, the content to which controlled vocabularies link can be accessed by third parties or made publicly accessible over the Web. Controlled vocabularies may be “reused”, if the original creator of a controlled vocabulary decides to license the vocabulary (without linked content) to other publishers to use on their own content, so that the second publisher does not have to reinvent a controlled vocabulary that already exists in same subject area.

Interoperability standards for controlled vocabularies include ZThes (a thesaurus schema for XML, which is has since gone out of style), World Wide Web Consortium (W3C) specifications for the Semantic Web including SKOS (Simple KnowledgeOrganization System) and the Web Ontology Language (OWL) for ontologies, and ISO 25964 Part 2: Interoperability with other vocabularies. Indeed, ISO 25964 covers consistency standards in its first part and interoperability standards in its second part. 

Metadata Schema

Since taxonomies or other controlled vocabularies may be used to provide terms that fill a certain metadata element/property/field within a larger set of metadata, the use of a standard metadata schema or model is yet another way in which interoperable standards involve taxonomies.  If structured content is to be shared or exchanged, the metadata fields need to be standardized with the same names, abbreviations, and purposes.

Examples of standard metadata schema include MARC for library materials, Dublin Core (DCMI) for generic online networked resources, IPTC (International Press Telecommunications Council) for photographs and other media, DDI (Data Documentation Initiative) for describing data from the social sciences, and PREMIS (Preservation Metadata: Implementation Strategies) for repositories of digital objects. Adopting such a metadata schema would be another way to enable sharing of content tagged with the metadata.

I was pleased to have the opportunity to learn more about information and publishing standards recently at the Society for Scholarly Publishing conference in attending pre-meeting seminar “All About Standards.”