.. py:currentmodule:: orangecontrib.bioinformatics.geo.dataset .. index:: NCBI .. index:: GEO .. index:: Gene Expression Omnibus .. index:: microarray data sets =============================== NCBI's Gene Expression Omnibus =============================== This module provides an interface to `NCBI `_'s `Gene Expression Omnibus `_ repository. It supports `GEO DataSets `_ query and retrieval. In the following example :obj:`GDS.get_data` construct a data set with genes in rows and samples in columns. Notice that the annotation about each sample is retained in ``.attributes``. >>> from orangecontrib.bioinformatics.geo.dataset import GDS >>> gds = GDS("GDS1676") >>> data = gds.get_data() >>> len(data) 719 >>> data[200] [0.503, 0.690, 0.607, -2.250, 0.000, ...] {CD40} >>> data.domain.attributes[0] ContinuousVariable(name='GSM63816', number_of_decimals=3) >>> data.domain.attributes[0].attributes {'infection': 'acute', 'time': '1 d', 'dose': '20 U/ml IL-2'} Class References ================= .. autoclass:: GDSInfo() :members: :special-members: __init__ .. autoclass:: GDS() :members: Usage ===== The following script prints out information about a specific data set. It does not download the data set, just uses the (local) GEO data sets information file (:download:`dataset_info.py `). .. literalinclude:: code/geo/dataset_info.py The output of this script is:: ID: GDS10 Features: 39114 Genes: 29942 Organism: Mus musculus PubMed ID: 11827943 Sample types: tissue (spleen, thymus) disease state (diabetic, diabetic-resistant, nondiabetic) strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3) Description: Examination of spleen and thymus of type 1 diabetes nonobese diabetic (NOD) mouse, four NOD-derived diabetes-resistant congenic strains and two nondiabetic control strains. Samples in GEO data sets belong to sample subsets, which in turn belong to specific types. The above GDS10 has three sample types, of which the subsets for the tissue type are spleen and thymus. For supervised data mining it would be useful to find out which data sets provide enough samples for each label. It is (semantically) convenient to perform classification within sample subsets of the same type. The following script goes through all data sets and finds those with enough samples within each of the subsets for a specific type. The function ``valid`` determines which subset types (if any) satisfy our criteria (:download:`dataset_samples.py `). .. literalinclude:: code/geo/dataset_samples.py The requested number of samples, ``n=40``, seems to be a quite a stringent criteria met - at the time of writing this - by 40 data sets with 48 sample subsets. The output starts with:: GDS1292 tissue:raphe magnus/40, somatomotor cortex/43 GDS1293 tissue:raphe magnus/40, somatomotor cortex/41 GDS1412 protocol:no treatment/47, hormone replacement therapy/42 GDS1490 other:non-neural/50, neural/100 GDS1611 genotype/variation:wild type/48, upf1 null mutant/48 GDS2373 gender:male/82, female/48 GDS2808 protocol:training set/44, validation set/50 Let us now pick data set GDS2960 and see if we can predict the disease state. We will use logistic regression, and within 10-fold cross validation measure AUC, the area under ROC. AUC is the probability of correctly distinguishing the two classes, (e.g., the disease and control). From (:download:`predict_disease_state.py `): .. literalinclude:: code/geo/predict_disease_state.py The output of this script is:: Samples: 101, Genes: 4069 AUC = 0.996 The AUC for this data set is very high, indicating that using these gene expression data it is almost trivial to separate the two classes.