NCBI’s Gene Expression Omnibus

This module provides an interface to NCBI’s Gene Expression Omnibus repository. It supports GEO DataSets query and retrieval.

In the following example GDS.get_data construct a data set with genes in rows and samples in columns. Notice that the annotation about each sample is retained in .attributes.

>>> from orangecontrib.bioinformatics.geo.dataset import GDS
>>> gds = GDS("GDS1676")
>>> data = gds.get_data()
>>> len(data)
719
>>> data[200]
[0.503, 0.690, 0.607, -2.250, 0.000, ...] {CD40}
>>> data.domain.attributes[0]
ContinuousVariable(name='GSM63816', number_of_decimals=3)
>>> data.domain.attributes[0].attributes
{'infection': 'acute', 'time': '1 d', 'dose': '20 U/ml IL-2'}

Class References

Usage

The following script prints out information about a specific data set. It does not download the data set, just uses the (local) GEO data sets information file (dataset_info.py).

""" Documentation script """
import textwrap


from orangecontrib.bioinformatics.geo.dataset import GDSInfo

gds_info = GDSInfo()
gds = gds_info["GDS10"]

print("ID:")
print(gds["dataset_id"])
print("Features: ")
print(gds["feature_count"])
print("Genes:")
print(gds["gene_count"])
print("Organism:")
print(gds["platform_organism"])
print("PubMed ID:")
print(gds["pubmed_id"])
print("Sample types:")

for sample_type in set([sinfo["type"] for sinfo in gds["subsets"]]):
    ss = [sinfo["description"] for sinfo in gds["subsets"] if sinfo["type"] == sample_type]
    print("  %s (%s)" % (sample_type, ", ".join(ss)))

print("")
print("Description:")
print("\n".join(textwrap.wrap(gds["description"], 70)))

The output of this script is:

ID:
GDS10
Features:
39114
Genes:
29942
Organism:
Mus musculus
PubMed ID:
11827943
Sample types:
  tissue (spleen, thymus)
  disease state (diabetic, diabetic-resistant, nondiabetic)
  strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3)

Description:
Examination of spleen and thymus of type 1 diabetes nonobese diabetic
(NOD) mouse, four NOD-derived diabetes-resistant congenic strains and
two nondiabetic control strains.

Samples in GEO data sets belong to sample subsets, which in turn belong to specific types. The above GDS10 has three sample types, of which the subsets for the tissue type are spleen and thymus. For supervised data mining it would be useful to find out which data sets provide enough samples for each label. It is (semantically) convenient to perform classification within sample subsets of the same type. The following script goes through all data sets and finds those with enough samples within each of the subsets for a specific type. The function valid determines which subset types (if any) satisfy our criteria (dataset_samples.py).

""" Documentation script

Check all data files from GEO, find those which include at least N
samples in all sample subsets of at least one sample type. Useful
when, for instance, filtering out the data sets that could be used for
supervised machine learning.
"""

from orangecontrib.bioinformatics.geo.dataset import GDSInfo, GDS


def valid(info, n=40):
    """Return a set of subset types containing more than n samples in every subset"""
    invalid = set()
    subsets = set([sinfo["type"] for sinfo in info["subsets"]])
    for sampleinfo in info["subsets"]:
        if len(sampleinfo["sample_id"]) < n:
            invalid.add(sampleinfo["type"])
    return subsets.difference(invalid)


def report(stypes, info):
    """Pretty-print GDS and valid susbset types"""
    for id, sts in stypes:
        print(id)
        for st in sts:
            gds = info[id]
            print("  %s:" % st +
                  ", ".join(["%s/%d" % (sinfo["description"], len(sinfo["sample_id"]))
                             for sinfo in gds["subsets"] if sinfo["type"] == st]))



gdsinfo = GDSInfo()
valid_subset_types = [(id, valid(info)) for id, info in sorted(gdsinfo.items()) if valid(info)]
report(valid_subset_types, gdsinfo)

print('datasets = ' + str(len(valid_subset_types)))
print('type subsets = ' + str(sum(len(b) for _, b in valid_subset_types)))

The requested number of samples, n=40, seems to be a quite a stringent criteria met - at the time of writing this - by 40 data sets with 48 sample subsets. The output starts with:

GDS1292
  tissue:raphe magnus/40, somatomotor cortex/43
GDS1293
  tissue:raphe magnus/40, somatomotor cortex/41
GDS1412
  protocol:no treatment/47, hormone replacement therapy/42
GDS1490
  other:non-neural/50, neural/100
GDS1611
  genotype/variation:wild type/48, upf1 null mutant/48
GDS2373
  gender:male/82, female/48
GDS2808
  protocol:training set/44, validation set/50

Let us now pick data set GDS2960 and see if we can predict the disease state. We will use logistic regression, and within 10-fold cross validation measure AUC, the area under ROC. AUC is the probability of correctly distinguishing the two classes, (e.g., the disease and control). From (predict_disease_state.py):

""" Documentation script """
from Orange.classification import LogisticRegressionLearner
from Orange.evaluation.testing import CrossValidation
from Orange.evaluation.scoring import AUC

from orangecontrib.bioinformatics.geo.dataset import GDS

gds = GDS("GDS2960")
data = gds.get_data(sample_type="disease state", transpose=True, report_genes=True)
print("Samples: %d, Genes: %d" % (len(data), len(data.domain.attributes)))

learners = [LogisticRegressionLearner()]
results = CrossValidation(data, learners, k=10)

print("AUC = %.3f" % AUC(results)[0])

The output of this script is:

Samples: 101, Genes: 4069
AUC = 0.996

The AUC for this data set is very high, indicating that using these gene expression data it is almost trivial to separate the two classes.