Genes

This module is a wrapper around Gene database provided from NCBI. It exposes a simple interface for working with genes in Python. Additionally, it provides a way to map any (almost) kind of gene identifier to its corresponding Entrez Id.

Usage

from orangecontrib.bioinformatics.ncbi.gene import GeneMatcher

# Notice that we have symbols, synonyms and Ensembel ID here
genes_of_interest = ['CD4', 'ENSG00000205426', "2'-PDE", 'HB-1Y']

# Initialize GeneMatcher. Human is our organism of interest.
gm = GeneMatcher('9606')
# this will automatically start the process of name matching
gm.genes = genes_of_interest

# print results
for gene, gene_obj in  zip(genes_of_interest, gm.genes):
        print(f"{gene:<20} {gene_obj}")

We are lucky all of the gene names have a unique match in the Gene database. That’s great!

CD4                  <Gene symbol=CD4, tax_id=9606, gene_id=920>
ENSG00000205426      <Gene symbol=KRT81, tax_id=9606, gene_id=3887>
2'-PDE               <Gene symbol=PDE12, tax_id=9606, gene_id=201626>
HB-1Y                <Gene symbol=HMHB1, tax_id=9606, gene_id=57824>

Now that we have identified our genes, we can explore further. Genes get automatically populated with additional information from the NCBI database.

g = gm.genes[0]

print(g.synonyms)
['CD4mut']

print(g.db_refs)
{'MIM': '186940', 'HGNC': 'HGNC:1678', 'Ensembl': 'ENSG00000010610'}

print(g.type_of_gene)
protein-coding

print(g.description)
CD4 molecule


# look at all the available Gene attributes
print(g.__slots__)
('species', 'tax_id', 'gene_id', 'symbol', 'synonyms', 'db_refs', 'description', 'locus_tag', 'chromosome',
'map_location', 'type_of_gene', 'symbol_from_nomenclature_authority', 'full_name_from_nomenclature_authority',
'nomenclature_status', 'other_designations', 'modification_date', 'homology_group_id',
'homologs', 'input_identifier')

We can also access homologs directly from Gene interface:

print(g.homologs)
{'9913': '407098', '10090': '12504', '10116': '24932'}

print(g.homology_group_id)
'513'

# Find homolog in mouse.
print(g.homolog_gene(taxonomy_id='10090'))
'12504'

Class References

class Gene[source]

Representation of gene summary.

__init__(input_identifier=None)[source]

If we want to match gene to it’s corresponding Entrez ID we must, upon class initialization, provide some input identifier. This way GeneMatcher will know what to match it against in Gene Database.

Parameters

input_identifier (str) – This can be any of the following: symbol, synonym, locus tag, other database id, …

homolog_gene(taxonomy_id)[source]

Returns gene homolog for given organism.

Parameters

taxonomy_id (str) – Taxonomy id of target organism.

Returns

Entrez ID (if available).

Return type

str

class GeneMatcher[source]

Gene name matching interface.

__init__(tax_id, progress_callback=None, auto_start=True)[source]
Parameters

tax_id: – str: Taxonomy id of target organism.

get_known_genes()[source]

Return Genes with known Entrez ID

Returns

Genes with unique match

Return type

list of Gene instances

match_table_attributes(data_table)[source]

Helper function for gene name matching with Orange.data.Table.

Match table attributes and if a unique match is found create a new column attribute for Entrez Id. Attribute name is defined here: orangecontrib.bioinformatics.ncbi.gene.config.NCBI_ID

Parameters

data_table (Orange.data.Table) – Data table

Returns

Data table column attributes are populated with Entrez Ids

Return type

Orange.data.Table

match_table_column(data_table, column_name, target_column=None)[source]

Helper function for gene name matching with Orange.data.Table.

Give a column of genes, GeneMatcher will try to map genes to their corresponding Entrez Ids.

Parameters
  • data_table (Orange.data.Table) – Data table

  • column_name (str) – Name of the column where gene symbols are located

  • target_column (StringVariable) – Column where we store Entrez Ids. Defaults to StringVariable(ncbi.gene.config.NCBI_ID)

Returns

Data table with a column of Gene Ids

Return type

Orange.data.Table

to_data_table(selected_genes=None)[source]

Transform GeneMatcher results to Orange data table.

Optionally we can provide a list of genes (Entrez Ids). The table on the output will be populated only with provided genes.

Parameters

selected_genes (list) – List of Entrez Ids

Returns

Summary of Gene info in tabular format

Return type

Orange.data.Table

class GeneInfo[source]
__init__(tax_id)[source]

Loads genes for given organism in a dict.

Each instance of Gene is mapped to corresponding Entrez ID

Parameters

tax_id (str) – Taxonomy id of target organism.