Open Source Book: Effective Data Modeling Topic #1

This is the first of many topics I will introduce for an open source book entitled: "Effective Data Modeling".  I heartily encourage the entire data, metadata and development community to jump in with your questions and comments by emailing me at mdaconta at oberonassociates.com (no mailto: link to hopefully avoid some spam).

The focus of the book is on resolving specific modeling challenges.  Each topic represents one challenge.  Here is an initial outline of potential topics to write on (I will add more description to each of these to explain the problem):

  1. Class versus Instance.  The boundary problem. (in this article)
  2. Collection versus Class.
  3. Class versus attribute.
  4. Characteristic versus Association.
  5. Abstract Class versus Concrete Class.
  6. Transaction versus Entity.
  7. Role versus Class.
  8. Fixed versus dynamic enumerations.
  9. Identifiers and Identity
  10. Opaque versus Explicit associations.
  11. Globally Unique versus Locally-unique identifiers.
  12. Value codes versus Labels.
  13. Specialization versus Subclass.
  14. Semantic versus Opaque identifiers.
  15. Long versus Short Names.
  16. Part-of versus Subclass.
  17. Explicit versus implicit qualifiers.
  18. Context-dependent versus independent metadata.
  19. Explicit versus implicit scope.
  20. Binary versus N-ary relationships.
  21. Optional versus mandatory attributes.
  22. Composition versus aggregation.
  23. Permanent versus temporary relationships.
  24. Concept versus representation
  25. Attribute versus category.
  26. Informal (subjective) versus formal categorization.
  27. Intensional versus extensional membership.
  28. Categories versus views
  29. Upper ontology versus ontology mapping.
  30. Conceptual versus logical data models.
  31. Metadata versus data.
  32. External versus Internal code tables.
  33. Term versus Class.
  34. Set versus collection.
  35. Alternative code lists versus harmonized code list.
  36. Folksonomy versus taxonomy.
  37. Enumeration versus identifier.
  38. [What modeling issues are you facing? Email me to discuss!]

Topic #1: Class versus Instance.

In software development, a Class is a structure or blueprint for how to construct an object (also called an instance). In many object oriented programming languages this act of constructon is called "instantiation".  In knowledge representation, a class (also called a Type) represents a group of characteristics that determine membership of an individual in a group.  So, for example a mammal is the class of animals that breathe air and birth live young.  An individual in that population is a particular representative of the class.  These two things often get confused during the process of developing taxonomies.  For example, if we are dividing aircrafts into categories we could easily create an initial hierarchy that looks like this:

Aircraft
     Fixed Wing
          Bomber
          Fighter
               F-16
               F-4
          Commercial Passenger
     Rotary Wing

The deductive leap from Fighter to F-16 is when people often believe that a line has been crossed from Class to Instance.   There are two modeling problems we face here: the first is whether the distinction between Class and Instance (or Type versus Object) is arbitrary and the second is the question of granularity of instances.  Let's tackle the first dilemma first - from an implementer's perspective a class is a group of characteristics (synonymous with columns in a database table), so that if we are adding characteristics we are creating new classes.  In that vein, the implementation of an instance is the population of those characteristics with specific values (accomplished in a database by creating a row and in programming by reserving a chunk of memory).  So, if you look at this from solely an implementer's perspective, the line can be blurred arbitrarily.  However, from a taxonomists perspective, this is dangerous because the role of a category in a taxonomy is to group instances and thus instances should not be part of a taxonomy.  So, going back to our example, how do we determine if an F-16 is a class or an instance if it should not be an arbitrary decision?  The answer lies in the distinction between a new type of thing and a particular style of a thing.  In other words is an F-16 a distinct type of fighter aircraft or is it just one of many styles?  The way to answer this is to ask yourself, are there other fighter aircrafts with the same characteristics as an F-16 that only vary in a non-type-distinguishing ways?  An example of a non-type-distinguishing way is what country an aircraft is manufactured in. 

Now, let's examine the issue of Instance granularity.  As stated previously, the implementer has the freedom of creating a very general class and creating instances of that class.  Let's look at an example of this:

Class: Automobile                                                  Instance:
------------------                                                  ------------------
Manufacturer:                                                         Toyota
Vehicle Identification Number:                                2837281Y8281
Year:                                                                      2006

Here we see a technically accurate implementation of an instance, that is semantically flawed.  The semantic flaw lies in the instance being so coarse grained that it does not accurately represent the uniqueness of the individual of the population.  This coarsenes prevents the data from being repurposed for any other use nor being used for complex reasoning on the individual. 

So, to summarize - the distinction between an instance and a class should not be an arbitrary decision.  Taxonomy categories should not include styles of a type.  Secondly, classes should be designed to create fine-grained instances with enough characteristics to divide a population into manageable chunks.


Do you agree with this analysis?

 

Did this article clarify or confuse you on the issue of Class versus Instance?  Please, let me know...