SEMANTIC INDEXING FOR COMPLEX PATIENT GROUPING
Kilian STOFFEL†‡ Joel SALTZ†‡, Jim HENDLER†, Jim DICK‡, William MERZ‡ and
we are working to provide medical specialists with
In this paper we describe indexing techniques based
the ability to express more complex queries without
on domain knowledge made available in the form of
becoming experts on the underlying data model. In
ontologies. In high level interfaces like those used in
this paper we show how we can support this efficient
indexing, facilitate complex data access, and support
advantageous to use terminology familiar to the end-
high-level querying for users untrained in the details
user of the system. This terminology is very oftendifferent from the one incorporated in the underlyingdatabase. Much of the terminology (i.e. concepts) in anMOTIVATING EXAMPLES end-user ontology will map to complex collections of
As stated above, our examples are motivated by a
data warehouse attribute value pairs. We create a
group of applications that involve the analysis of
mapping between the end-user terminology and
information in a medical data warehouse we are
attribute-value pairs in the data warehouse. We
constructing at Johns Hopkins Hospital. The data
optimize performance by indexing the data warehouse
warehouse includes in-patient and outpatient data
using an ontology. We optimize performance by
from the microbiology, blood bank, clinical chemistry
indexing the whole database using an ontology that
and hematology laboratories, along with pharmacy
represents the end-user’s terminology.
data and some clinical data. The microbiology data isobtained through a large and varied group of
INTRODUCTION
identification techniques, which are generally carried
A joint effort between computer scientists at the
out in a stepwise manner. Protocols are then followed
University of Maryland and clinicians at Johns
to produce a more precise identification.
Hopkins Hospital focuses on using high performance
The scope of this paper precludes a detailed
computing technology in support of medical
discussion of microbiological taxonomy 1. A bacterial
applications. One focus is on providing tools that
support medical data warehousing. One particular
1. structural attributes of size, shape, Gram stain
project focuses on providing database support to
microbiologists and infectious disease specialists.
The taxonomy used to describe microbiological
data is complex; this terminology also changes with
time. A problem we have encountered is the difficulty
4. biochemical information on cell composition and
clinicians and researchers face in formulating queries
that capture the kinds of questions they wish to pose.
As a means for overcoming this sort of difficulty,
we are looking at how to efficiently integrate
Microbiological identification requires a process
‘‘semantic’ knowledge, stored in the form of thesauri
of iterative refinement; there is not a single battery of
(or more technically, ontologies) in ways that support
tests that can be applied to all specimens to obtain a
efficient indexing of large databases. In particular,
The set of protocols required to precisely identify
80% of the specimens belonging to any sub-category
an organism can be represented as a directed acyclic
graph. Most organisms are not precisely identified as
Note that we need to define what is meant by an
the iterative process of organism identification
effective antibiotic. In the context of a clinical
continues only as long as clinically useful information
laboratory, MIC values are used to evaluate antibiotic
is obtained. A highly imprecise identification is
effectiveness. No single MIC threshold defines
permissible as long as no clinical decision hinges on a
effectiveness for all antibiotics; each antibiotic has its
more complete identification. Identical organisms
may be identified to different degrees of precision as
Our tools support queries that characterize the
clinical requirements may vary by anatomic site or
effectiveness of antibiotics with respect to many
possible categories of organisms. The two categorical
terms (antibiotics and Enterobacteriaceae) allow us to
microorganisms and the non-uniform degree of
discuss two types of categories. The first term
microorganism identification create difficulties for
"Antibiotics" is replaced by a list of all antibiotics
those who wish to pose queries to a clinical data
warehouse. We will present some realistic examples
"Enterobacteriaceae” is replaced by all its sub-
of queries that are of clinical interest at Johns
categories. Sub-categories of Enterobacteriaceae are
Hopkins Hospital. The clinical context associated
not disjoint. For instance, Citrobacter constitutes one
with these examples is an ongoing effort to
subcategory and Citrobacter diversus and Citrobacter
characterize microorganism antibiotic susceptibility. freundii constitute two additional sub-categories.
This issue is of continuing concern because of thecontinuing
Which groups of patients have been infected by
microorganisms. In the examples below, note that the
organisms that have been proven to be increasingly
minimal inhibitory concentration (MIC) of an
antibiotic is the lowest concentration of antibiotic that
For a given group, we want to identify classes of
organisms that exhibit an increase in resistance, overtime, to specific antibiotics. This scenario assumes
Produce a histogram of the minimal inhibitory
that we have an ontology that specifies a set of groups
concentrations to the antibiotic Ciprofloxacin for all
to which each patient belongs. Examples of such
specimens that grew out gram negative non-lactose
groups could include a neighborhood, school and
workplace (in an outpatient setting) and might
More precisely, the task is to: (1) report the
include hospital unit and service in an in-patient
Ciprofloxacin minimal inhibitory concentrations of
setting. Other groupings based on demographic data
all specimens that grew out organisms that belong to
(e.g. age) would also be possible. Furthermore
each category C (sub-concept) of gram negative non-
groups would also include potentially relevant
lactose fermenting bacteria, and (2) produce a
diagnostic categories or current treatment with a
histogram for each category C grouped by MIC value.
therapy that involves immunosuppression.
Note that in this query, the user doesn’t have to
know which bacteria are supposed to ferment lactose. SEMANTIC INDEXING
Furthermore, as long as the ontology defines a species
In this section we describe a sophisticated
as being one that ferments lactose, it does not matter
indexing schema which allows us to support the kinds
whether or not the protocol used to identify the
of queries described here in an efficient way. Our
bacteria actually involved evaluating whether or not a
indexing scheme is based on ontologies: taxonomic
information with additional links that representassociated properties. Which antibiotics are effective against 80% oforganisms recovered from cultures that grew out anyOntologies
employed to represent ontologies are DAGs (directedacyclic graphs). A node in the DAG is called a
concept and represents a specific object or action.
antibiotics are effective against 80% of specimens
The directed links pointing from one concept to
belonging to the concept "Enterobacteriaceae" or
another define the concept/sub-concept relationships. Microorganism Gram negative Bacteria Enterobacteriaceae Escherichia coli Hafnia alvei Microorganism Figure 1: Relation between an Ontology and a data base.
to concepts in the ontology. This mapping requires
knowledge (WordNet) 2 ,other ontologies specifically
the use of a data dictionary to translate database
target medicine (e.g. SnoMed 3 and UMLS). 4 We
attribute value pairs to ontology concepts. An
define a mapping between the attributes used in the
example is given in Figure 1. The solid links are
data warehouse and the terminology represented in an
ontological links and the dashed links are indexing
ontology. Users employ terms defined in an ontology
when generating data warehouse queries. The use of
Thus, for example, we have indexing links from
ontologies to help users select terms needed in their
the ontology concepts Escherichia coli and Hafnia
queries has been described by several researchers, see
alvei to tuples 1 and 3. We also have an indexing link
from the ontology concept Enterobacteriaceae to
Ontologies can also provide a way to group
tuple 2. Tuples 1 and 3 each record a specific
records of a database in a semantically meaningful
microorganism genus and species. Tuple 2 records a
way. This type of semantic grouping can be used to
microorganism that is identified only as a member of
optimize query performance. We anticipate that
the family Enterobacteriaceae. Note that Escherichia
experts will frequently access data using groupings
coli and Hafnia alvei are also members of the family
defined in an ontology. The ontology can be used to
Enterobacteriaceae, but we only index the concept
create indices which allow us to retrieve data grouped
that maps directly to an attribute value pair.
by ontological concepts. Just as b-trees permit theretrieval of a range of data, we are able to retrieve aset of records which are semantically associated with
Data Structures and Algorithms
a concept in an ontology. These optimizations make
Data structures: Each concept consists of three
it practical to develop a tool for formulating complex
components: (1) The concept ID, (2) a list of sub-
queries such as the queries depicted in Section 2.
concepts, and (3) a list of pointers to database tuples. The sub-concept list associated with concept C is a
USING ONTOLOGIES FOR INDEXING
list of pointers into the ontology file that identifies allsub-concepts of C. The list of pointers to database
Indexing
tuples identify instances of concept C. These basicdata structures are used by the following algorithms:
In order to use the ontologies for indexing we
have to establish links between the data in the data
Instances: This operation retrieves all the
warehouse and the concepts in the ontology. All pairs
direct instances of a concept.
of attributes and values in the database† are mapped
universal relation. This is not a necessary condition,
† Throughout this paper we will make the
but simplifies the discussion of the basic ideas.
simplifying assumption that data is stored in a
1: A := TRANSITIVE_SUBCONCEPT(Enterobacteriaceae)2: B := TRANSITIVE_SUBCONCEPT(ANTIBIOTIC)3: C := {susceptible, very susceptible }4: for each c in C D := D union TRANSITIVE_INSTANCE(c)5: for each a in A6: for each b in B
E[a,b] := TRANSITIVE_INSTANCE(b)∩ TRANSITIVE_INSTANCE(a) ∩D report (a,b) if (count(E[a,b])/count(TRANSITIVE_INSTANCE(a)))>0.8
Figure 2: Pseudocode for Example 2. Sub-Concept: This operation retrieves the sub-
Enterobacteriaceae sub-concept a proved to be
Transitive Sub-Concept: This operation returns
susceptible or very susceptible to antibiotic b.
all sub-categories reachable from a concept by
In our database, we maintained results for 27
following all possible directed links.
antibiotics and we store 65 sub-concepts of
Transitive Instance: Transitive instance is a
Enterobacteriaceae. Thus Example 2 allows us to
combination of the transitive sub-concept operation
evaluate antibiotic susceptibilities for 1755 different
and the instance operation. In the first step, all
combinations of organism and antibiotic. This query
reachable sub-concepts are collected. In the next step,
required roughly 40 minutes on a 150 MHz Pentium
all instances of these concepts are gathered.
PC using a database with 20521 records oforganisms. The scope of this paper precludes a
DISCUSSION OF THE EXAMPLES
detailed discussion of results obtained. We found, forinstance, that over 80% of our Enterobacteriaceae
In this section we present a more detailed
were susceptible or very susceptible to amikacin,
description of Example 2. Space limitations prevent
ciprofloxacin, ceftazidime, cefuroxime, gentamicin,
us from presenting detailed discussions of the other
piperacillin, trimethoprim/sulfa, ticarcillin, and
tobramycin. For all species of Citrobacter takentogether, only amikacin, ciprofloxacin, gentamicin,
Example 2
and tobramycin were effective against 80% of
Which antibiotics are effective against 80% of
specimens. However, for Citrobacter diversus three
organisms recovered from cultures that grew out any
additional antibiotics were effective in 80% of
Figure 2 depicts the steps that must be carried out
In order to execute this query without using the
to implement Example 2. In Step 1 we calculate all
indexing scheme, most of the 1755 different
transitive subconcepts of “Enterobacteriaceae” and
combinations of organism and antibiotic would have
assign this to Set A. Set A will then contain all
to be accounted for through large disjunctive queries.
microorganism names that are subconcepts of“Enterobacteriaceae”. In Step 2 we assign to set B the
RELATED WORK
transitive sub-concepts of the concept “Antibiotic”. In Step 3 we create a simple set with the two concepts
The integration of ontologies into data base
“susceptible” and “very susceptible”. The union of all
systems is of growing interest. Due to the space
the transitive instances of “susceptible” and “very
restrictions we are not able to provide here an
susceptible” is then assigned to set D in Step 4. Set D
overview of all the different approaches. Ullman
thus contains all patient records where susceptible
presented a selection of these ideas in 6 .
bacteria were found. We then intersect the set D with:
There are a wide variety of projects that address
(1) all transitive instances of each sub-concept of
issues associated with making it easier for users to
“Antibiotic” and (2) all transitive instances of all
formulate medical database queries (e.g. . 5, 7, 8). Our
sub-concepts of “Enterobacteriaceae”. This yields a
focus is somewhat different as we provide a powerful
collection of sets E(a,b). Each set E(a,b) contains
(but not simple) user interface for automatic
generation of very large sets of related queries.
Enterobacteriaceae sub-concept a and each antibiotic
sub-concept b. Finally, in step 6, we report (a,b) only
presented in this paper can be found in some of the
more powerful search engines that target text data
Journal of the American Medical Informatics
bases; these search engines are mainly used in WEB
Association, Volume 3, Number4, Jul/Aug 1996.
browsers. 9 10 In our approach, we apply ontology
Jeffry D. Ullman. The database approach
indexing schemes to relational databases rather than
to knowledge representation. Proceedings of the
to text databases (although in future work we will
extend our work to a combination of relational and
Association for Artificial Intelligence. AAAI Press,
text databases). Another important difference is that
we support transitive closure operations on concepts
C. Safran, D. Porter, J. Lightfoot et al.
and instances. This functionality increases flexibility
CliQuery: a system for online searching of data in
and makes it possible to carry out sets of closely
a teaching hospital. Ann Intern Med. 1989;111;751-
related queries in an optimized fashion.
Complex indexing schemes are also known in the
B. Leao, E. Reategui. HYCONES: a
deductive data base community. A good overview of
Hybrid Connectionist Expert System. SCAMC
these systems is given in 11. These systems offer
more functionality than the system we propose but do
altavista-web.
not scale well with increasing database and ontology
http://altavista.digital.com/av/lt/help.html, Altavista. [10] medline-web CONCLUSION AND FUTURE WORK [11] Jose Alberto Fernandez, Jarek Gryz, and Jack Minker, Disjunctive deductive databases:
We presented an efficient semantic indexing
Semantics, architecture,
scheme for complex grouping operations. We showed
Proceedings of the 4th Bar -Ilan Symposium on
this scheme could be used for integrating ontological
Foundations of AI, pages 256--274. AAAI Press,
and relational data. A prototype of the system is
currently being tested in the Johns Hopkins Hospital,
[12] J. Saltz, G. Agrawal, C. Chang, R. Das, G.
the prototype is being used to track the spread of
Edjlali, P. Havlak, Y.S. Hwang, B. Moon, R.
antibiotic resistant bacteria, evaluate patterns of
Ponnusamy, S. Sharma, A. Sussman and M. Uysal,
antibiotic use, and to screen for nosocomial
Programming Irregular Applications: Runtime
infections. In the near future, we plan to parallelize
support, Compilation and Tools, In Advances in
the prototype to allow it to function in an interactive
Computers,Academic Press, Vol 45, 1997.
manner. In past work, we have parallelized many ofthe computational components employed in thisprototype12 and we anticipate that parallelizing ourprototype should be relatively straighforward. We arealso currently exploring the use of these indexingtechniques in other domains in which complex dataretrieval is used and where ontologies can begenerated. REFERENCES I.Bergey. “Systematic Bacteriology”. George A. Miller. “Human language
Department, Green Hall, Princeton University, 1996. Roger Code. “Systemized Nomenclature of Human and Veterinary Medicine: SNOMED”. College
Veterinary Medical Association, 1993. [4] UMLS. “Unified Language System”. National Library of Medicine, 1994. George Hripcsak, Barry Allen Janes J. Cimino and Rober Lee. Access to Data: Comparing AccessMed with Query by Review.