Conceived and designed the experiments: CAH ALB NAC. Performed the experiments: CAH. Analyzed the data: CAH. Contributed reagents/materials/analysis tools: CAH NAC. Wrote the paper: CAH NB ALB NAC. Worked on database and website: NB.
The authors have declared that no competing interests exist.
The use of networks to integrate different genetic, proteomic, and metabolic datasets has been proposed as a viable path toward elucidating the origins of specific diseases. Here we introduce a new phenotypic database summarizing correlations obtained from the disease history of more than 30 million patients in a Phenotypic Disease Network (PDN). We present evidence that the structure of the PDN is relevant to the understanding of illness progression by showing that (1) patients develop diseases close in the network to those they already have; (2) the progression of disease along the links of the network is different for patients of different genders and ethnicities; (3) patients diagnosed with diseases which are more highly connected in the PDN tend to die sooner than those affected by less connected diseases; and (4) diseases that tend to be preceded by others in the PDN tend to be more connected than diseases that precede other illnesses, and are associated with higher degrees of mortality. Our findings show that disease progression can be represented and studied using network methods, offering the potential to enhance our understanding of the origin and evolution of human diseases. The dataset introduced here, released concurrently with this publication, represents the largest relational phenotypic resource publicly available to the research community.
To help the understanding of physiological failures, diseases are defined as
specific sets of phenotypes affecting one or several physiological systems. Yet,
the complexity of biological systems implies that our working definitions of
diseases are careful discretizations of a complex phenotypic space. To reconcile
the discrete nature of diseases with the complexity of biological organisms, we
need to understand how diseases are connected, as connections between these
different discrete categories can be informative about the mechanisms causing
physiological failures. Here we introduce the Phenotypic Disease Network (PDN)
as a map summarizing phenotypic connections between diseases and show that
diseases progress preferentially along the links of this map. Furthermore, we
show that this progression is different for patients with different genders and
racial backgrounds and that patients affected by diseases that are connected to
many other diseases in the PDN tend to die sooner than those affected by less
connected diseases. Additionally, we have created a queryable online database
(
There are no clear boundaries between many diseases, as diseases can have multiple
causes and can be related through several dimensions. From a genetic perspective, a
pair of diseases can be related because they have both been associated with the same
gene
During the past half-decade, several resources have been constructed to help
understand the entangled origins of many diseases. Many of these resources have been
presented as networks in which interactions between disease-associated genes,
proteins, and expression patterns have been summarized. For example, Goh et al.
created a network of Mendelian gene-disease associations by connecting diseases that
have been associated with the same genes
While progress on the genetic and proteomic fronts has been impressive
Typically, we say that a comorbidity relationship exists between two diseases
whenever they affect the same individual substantially more than chance alone. One
of our primary goals here is to make available pairwise comorbidity correlations for
more than 10 thousand diseases reconstructed from over 30 million medical records.
For completeness and utility, we organize the results in 18 different datasets. Each
summarizes phenotypic associations extracted from four years worth of ICD9-CM claims
data at the 5 and 3 digit level. Results are grouped into subsets of race, gender,
and both race and gender (see SM). To facilitate their use, the datasets are
available as a bulk download (
In the past, comorbidities have been used extensively to construct synthetic scales
for mortality prediction
Hospital claims offer reliable, systematic, and complete data for disease
detection
For the 32 million elderly Americans aged 65 or older enrolled in Medicare and
alive for the entire study period, there were a total of 32,341,347 inpatient
claims, pertaining to 13,039,018 individuals (the remaining individuals were not
hospitalized at any point during this period). Demographically, our data set
consists of patients over 65 years old (see
A. Age distribution for the study population. B. Demographic breakdown of
the study population. C. Prevalence distribution for all diseases
measured using ICD9 codes at the 5 digit level. D. Distribution of the
relative risk (
The medical claims were made available to us is in the ICD-9-CM format, representing a controlled nomenclature constructed mainly for insurance claim purposes. Therefore, in some cases, more than one code corresponds to a particular disease, whereas in other cases codes are not specific enough for research purposes. For example, at the 5-digit level there are 33 diagnoses associated with hypertension, which reduce to five at the 3-digit level. Other times, the code is for a symptom such as “dehydration” which cannot be assigned to any one diagnosis. The vast majority of diseases, however, do map reliably to ICD9 codes.
While hospital claims have been proposed as a reliable method for disease
detection
To measure relatedness starting from disease co-occurrence, we need to quantify
the strength of comorbidities by introducing a notion of
“distance” between two diseases (see
We will use two comorbidity measures to quantify the distance between two
diseases: The Relative Risk (
The distribution of
These two comorbidity measures are not completely independent of each other
(
One important question is how the predictive power of comorbidity based
relationships compares with that of heredity and known genetic markers. Of the
two measures discussed above, the Relative Risk
We can summarize the set of all comorbidity associations between all diseases expressed in the study population by constructing a Phenotypic Disease Network (PDN). In the PDN, nodes are disease phenotypes identified by unique ICD9 codes, and links connect phenotypes that show significant comorbidity according to the measures introduced above.
In principle, the number of disease-disease associations in the PDN is
proportional to the square of the number of phenotypes, yet many of these
associations are either not strong or are not statistically significant (see
SM). Hence, we explore the structure of the PDN by focusing on the strongest and
most significant of these associations. To achieve this, we offer two
visualizations of the PDN (see SM), the first constructed using
Nodes are diseases; links are correlations. Node color identifies the
ICD9 category; node size is proportional to disease prevalence. Link
color indicates correlation strength. A. PDN constructed using
While there are many similarities between the two networks, such as the proximity
between nephritis and hypertension or psychiatric disorders and poisoning, the
overall structure of the PDN and the specific disease groups present in each one
of them reflect the individual biases of the metric used to construct the links.
The network constructed using
While a network representation of diseases has many potential applications, here
we concentrate on three examples illustrating the use of the PDN to study the
illness progression from a network dynamics perspective
These limitations require us to adopt a more conservative approach in our
analysis. Here we explore disease network dynamics by asking three questions
(
A. Schematic representation of the three dynamical questions explore
here. B. Average
To answer the first question (Q1) we use a recently introduced method
to decide whether a node property spreads along the links of a network
While our data does not allow us to be conclusive about the directionality of
disease progression, differences in the strength of comorbidity relationships
can still indicate differences in the dynamics of illness progression. The
reason is that patients affected by a pair of diseases had traversed the link
between them at some point in time and in one of the two possible directions.
Here, we explore Q2 by looking at differences in the strength of the
observed comorbidities for patients from different ethnic background and
genders. For this we calculate the odds ratio for the difference in comorbidity
between diseases
We discuss as an example a network showing differences in the strength of
comorbidities between white and black males. We illustrate this on the subset of
Finally, we explore our third question (Q3) by showing that the
lethality of a disease is associated with its connectivity in the PDN. We can
quantify the connectivity of a particular disease by adding the correlations
between a disease and all other diseases to which it is connected
A. Scatter plot between the connectivity of a disease measured in the
A possible explanation for the observed correlation between connectivity and
lethality is that sicker patients accrue more diagnoses and hence the observed
correlation is just a restatement of this trivial fact. We can rule this out by
looking at the correlation between the average connectivity of diseases
diagnosed to patients with a given number of hospital visits, diagnoses, and
number of years they remained alive after the last diagnosis was observed. We
performed this analysis by looking at data on the 7,878,255 patients for which
we know the exact year of death; the remaining patients were reported as either
alive or unknown in our data set.
A. Histogram with the number of visits for each patient for which the year of death is known. B. Histogram for the number of diagnosis assigned to each patient for which the year of death is known. C. Correlation between the average connectivity of the diagnosis assigned to a patient and the number of years survived after the last diagnosis was recorded for groups of patients with the same number of hospital visits. D. Correlation between the average connectivity of the diagnosis assigned to a patient and the number of years survived after the last diagnosis was recorded for groups of patients with the same number of total number of diagnosis assigned. Error margins in C and D represent 95% confidence intervals.
Finally, we briefly analyze the directionality of disease progression, as observed in our data, keeping in mind that the limited observation period of our study limits our ability to be conclusive about disease directionality because of the aforementioned reasons. Hence, we interpret the following results as suggestive evidence of directionality rather than as a proof. To reduce the noise levels of our analysis we concentrate on links between diseases affecting at least 1 in 500 patients (0.2%), which from the size of our data set, are expected to co-occur in at least 50 patients. At the 5 digit level our comorbidity data contains 133,858 links connecting the 518 diseases affecting at least 1 out of 500 patients.
Consider the link connecting diseases
A value of
A. Distribution of λ1→2 B. Disease precedence
Λi as a function of disease prevalence
The directionality analysis allows us to extend our study of disease connectivity
and lethality to include the directionality of the links connecting a disease to
other diseases in the PDN. By assigning a direction to the links connecting a
disease with other diseases in the PDN we can classify diseases into
Λi is positive for diseases that tend to come before other
diseases and is negative for diseases that tend to come after other diseases. We
find that Λi is not independent of disease prevalence, as it
exhibits a slow, logarithmic, dependence on it (
While there is a great deal of expectation that disease associations are of enormous potential value to the research community, the lack of phenotypic data available to complement genotypic and proteomic datasets has limited scientific progress towards elucidating the origins of human disease. Here we take a step toward rectifying this situation by introducing an extensive, publicly available data set quantifying comorbidity associations expressed in a large population.
An important issue raised by calls for phenotypic network information is the
potential integration of phenotypic data with genetic and proteomic data to better
elucidate disease etiology. There are, however, other potential applications of a
network-based approach to diseases. Phenotypic “maps” like the
ones presented here could be used to study the disease evolution of patients and
represent an ideal way to visualize and represent medical health records in a future
in which digital medical records will need to be accessed by health care workers in
a delocalized manner
Here we have shown suggestive evidence that patients develop diseases close in the PDN to those already affecting them. We also showed that the PDN has a heterogeneous structure where some diseases are highly connected while others are barely connected at all. While not conclusive, these observations can explain the observation that more connected diseases are seen to be more lethal, as patients developing highly connected diseases are more likely those at an advanced stage of disease, which can be reached through multiple paths in the PDN.
Exploring comorbidities from a network perspective could help determine whether
differences in the comorbidity patterns expressed in different populations indicate
differences in biological processes, environmental factors, or health care quality
provided for each population. Here we show as a first step that there are
differences in the strength of co-morbidities measured for patients of different
races and gender. The PDN could be the starting point of studies exploring these and
related questions. This is why we make our data available to the research community
at (
Supplementary Material
(2.33 MB DOC)
We thank Laurie Meneades for the expert data programming required to build the analytic data set and Z. Oltvai and C. Teutsch for useful medical discussions.