Sample Annotator

class AnnotateSamples[source]

AnnotateSamples is class used for the annotation of data items with the annotations (e.g. Cell Types). We used the Mann-Whitney U test for selecting important values and the Hyper-geometric for assigning the annotations.

Example on biological data where we assign cell types to cells:

>>> gene_expressions_df = pd.read_csv("data/DC_expMatrix_DCnMono.csv.gz",
...                                   compression='gzip')
>>> marker_genes_df = pd.read_csv("data/panglao_gene_markers.csv.gz",
...                               compression="gzip")
>>> # rename genes column and filter human markers
>>> marker_genes_df = marker_genes_df[
...     marker_genes_df["Organism"] == "Human"]
>>>
>>> annotations = AnnotateSamples.annotate_samples(
...     gene_expressions_df, marker_genes_df, num_all_attributes=60000,
...     attributes_col="Cell Type", annotations_col="Name",
...     p_threshold=0.05)

Example for full manual annotation. Here annotation is split in three phases. We assume that data are already loaded.

>>> z = AnnotateSamples.mann_whitney_test(gene_expressions_df)
>>> scores, p_val = AnnotateSamples.assign_annotations(
...     z, marker_genes_df, gene_expressions_df, num_all_attributes=60000,
...     attributes_col="Cell Type", annotations_col="Name")
>>> scores = AnnotateSamples.filter_annotations(
...     scores, p_val, p_threshold=0.05)

static annotate_samples(data, available_annotations, num_all_attributes=None, attributes_col='Attributes', annotations_col='Annotations', return_nonzero_annotations=True, p_threshold=0.05, p_value_fun='binom', z_threshold=1, scoring='scoring_exp_ratio', normalize=False)[source]

Function marks the data with annotations that are provided. This function implements the complete functionality. First select attributes for each item with z_test, then annotate data and filter data.

Parameters

data (pd.DataFrame) – Tabular data
available_annotations (pd.DataFrame) – Available annotations (e.g. cell types), this data frame has two columns: attributes column name is set by attributes_col variable (default: Attributes) and annotations is set by annotations_col variable (default: Annotations).
num_all_attributes (int) – The number of all attributes for a case (also those that do not appear in the data). In the case of genes, it is the number of all genes that an organism has. It is recommended to set your value, in cases when the value is not set the number of attributes in z_values table will be used.
return_nonzero_annotations (bool, optional (default=True)) – If true return scores for only annotations present in at least one sample.
attributes_col (str) – The name of an attributes column in available_annotations (default: Attributes”).
annotations_col (str) – The name of an annotations column in available_annotations (default: Annotations”).
p_threshold (float) – A threshold for accepting the annotations. Annotations that has FDR value bellow this threshold are used.
p_value_fun (str, optional (defaults: TEST_BINOMIAL)) – A function that calculates p-value. It can be either PFUN_BINOMIAL that uses statistics.Binomial().p_value or PFUN_HYPERGEOMETRIC that uses hypergeom.sf.
z_threshold (float) – The threshold for selecting the attribute. For each item the attributes with z-value above this value are selected.
scoring (str, optional (default = SCORING_EXP_RATIO)) – Type of scoring
normalize (bool, optional (default = False)) – This variable tells whether to normalize data or not.

Returns

Scores table - each line of the table has scores that tell how probable is that items have specific annotations.

Return type

pd.DataFrame

static assign_annotations(z_values, available_annotations, data, num_all_attributes=None, attributes_col='Attributes', annotations_col='Annotations', z_threshold=1, p_value_fun='binom', scoring='scoring_exp_ratio')[source]

The function gets a set of attributes (e.g. genes) for each item and attributes for each annotation. It returns the annotations significant for each item.

Parameters

z_values (pd.DataFrame) – DataFrame that shows z values for each item. Rows are data items and columns are attributes.
available_annotations (pd.DataFrame) – Available annotations (e.g. cell types), this data frame has two columns: attributes column name is set by attributes_col variable (default: Attributes) and annotations is set by annotations_col variable (default: Annotations).
data (pd.DataFrame) – Tabular input (raw) data - we need that to compute scores.
num_all_attributes (int) – The number of all attributes for a case (also those that do not appear in the data). In the case of genes, it is the number of all genes that an organism has. It is recommended to set your value, in cases when the value is not set the number of attributes in z_values table will be used.
attributes_col (str) – The name of an attributes column in available_annotations (default: Attributes”).
annotations_col (str) – The name of an annotations column in available_annotations (default: Annotations”).
z_threshold (float) – The threshold for selecting the attribute. For each item, the attributes with z-value above this value are selected.
p_value_fun (str, optional (defaults: TEST_BINOMIAL)) – A function that calculates the p-value. It can be either PFUN_BINOMIAL that uses binom.sf or PFUN_HYPERGEOMETRIC that uses hypergeom.sf.
scoring (str, optional (default=SCORING_EXP_RATIO)) – Type of scoring

Returns

pd.DataFrame – Annotation probabilities
pd.DataFrame – Annotation FDRS.

static filter_annotations(scores, p_values, return_nonzero_annotations=True, p_threshold=0.05)[source]

This function filters the probabilities on places that do not reach the threshold for p-value and filter zero columns if return_nonzero_annotations is True.

Parameters

scores (pd.DataFrame) – Scores for each annotation for data items
p_values (pd.DataFrame) – p-value scores for annotations for data items
return_nonzero_annotations (bool) – The flag that enables filtering the non-zero columns.
p_threshold (float) – A threshold for accepting the annotations. Annotations that have FDR value bellow this threshold are used.

Returns

Filtered scores for each annotation for data items

Return type

pd.Dataframe

static log_cpm(data)[source]

Function normalizes data with the log CPM methods.

Parameters: data (pd.DataFrame) – Non-normalized data table.
Returns: Normalized data table.
Return type: pd.DataFrame

static mann_whitney_test(data)[source]

Compute z values with the Mann-Whitney U test.

Parameters: data (pd.DataFrame) – Tabular data.
Returns: Z-value for each item.
Return type: pd.DataFrame

Projection Annotator

This module cluster the projection of data (usually it is 2D projection) with one of the standard algorithms and attach a certain number of labels per cluster.

Example:

>>> from Orange.projection import TSNE
>>> from Orange.data import Table
>>> from orangecontrib.bioinformatics.utils import serverfiles
>>> from orangecontrib.bioinformatics.annotation.annotate_projection import ...     annotate_projection
>>> from orangecontrib.bioinformatics.annotation.annotate_samples import ...     AnnotateSamples
>>>
>>> # load data
>>> data = Table("https://datasets.orange.biolab.si/sc/aml-1k.tab.gz")
>>> marker_p = serverfiles.localpath_download(
...     'marker_genes','panglao_gene_markers.tab')
>>> markers = Table(marker_p)
>>>
>>> # annotate data with labels
>>> annotator = AnnotateSamples()
>>> annotations = annotator.annotate_samples(data, markers)
>>>
>>> # project data in 2D
>>> tsne = TSNE(n_components=2)
>>> tsne_model = tsne(data)
>>> embedding = tsne_model(data)
>>>
>>> # get clusters and annotations for clusters
>>> clusters, clusters_meta, eps = annotate_projection(annotations, embedding,
...     clustering_algorithm=DBSCAN, eps=1.2)

In case when user uses a DBSCAN algorithm and do not provide eps to the annotate_projection function it is computed automatically with knn method.

>>> clusters, clusters_meta, eps = annotate_projection(annotations, embedding,
...     clustering_algorithm=DBSCAN)

annotate_projection(annotations, coordinates, clustering_algorithm=<class 'Orange.clustering.dbscan.DBSCAN'>, labels_per_cluster=3, **kwargs)[source]

Function cluster the data based on coordinates, and assigns a certain number of labels per cluster. Each cluster gets labels_per_cluster number of most common labels in cluster assigned.

Parameters

annotations (Orange.data.Table) – Table with annotations and their probabilities.
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clustering_algorithm (callable, optional (default = DBSCAN)) – Algorithm used in clustering.
labels_per_cluster (int, optional (default = 3)) – Number of labels that need to be assigned to each cluster.

Returns

Orange.data.Table – List of cluster indices.
dict – Dictionary with cluster index as a key and list of annotations as a value. Each list include tuples with the annotation name and their proportion in the cluster.
dict – The coordinates for locating the label. Dictionary with cluster index as a key and tuple (x, y) as a value.

assign_labels(clusters, annotations, labels_per_cluster)[source]

This function assigns a certain number of labels per cluster. Each cluster gets labels_per_cluster number of most common labels in cluster assigned.

Parameters

clusters (Orange.data.Table) – Cluster indices for each item.
annotations (Orange.data.Table) – Table with annotations and their probabilities.
labels_per_cluster (int) – Number of labels that need to be assigned to each cluster.

Returns

dict – Dictionary with cluster index as a key and list of annotations as a value. Each list include tuples with the annotation name and their proportion in the cluster.
Orange.data.Table – The array with the annotation assigned to the item.

cluster_additional_points(coordinates, hulls, cluster_attribute=None)[source]

This function receives additional points and assign them current existing clusters based on current concave hull.

Parameters

coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
hulls (dict) – Concave hull for each cluster
cluster_attribute (Orange.data.DiscreteVariable (optional)) – A variable for clusters. If cluster_attribute is provided it will be used in the creation of the resulting Table.

Returns

Cluster label for each point

Return type

Orange.data.Table

cluster_data(coordinates, clustering_algorithm=<class 'Orange.clustering.dbscan.DBSCAN'>, **kwargs)[source]

This function receives data and cluster them.

Parameters

coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clustering_algorithm (callable) – Algorithm used for clustering.

Returns

List of cluster indices.

Return type

Orange.data.Table

compute_concave_hulls(coordinates, clusters, epsilon)[source]

Function computes the points of the concave hull around points.

Parameters

coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clusters (Orange.data.Table) – Cluster indices for each item.
epsilon (float) – Epsilon used by DBSCAN to cluster the data

Returns

The points of the concave hull. Dictionary with cluster index as a key and np.ndaray of points as a value - [[x1, y1], [x2, y2], [x3, y3], …]

Return type

dict

edges_to_polygon(edges_list, points_list)[source]

This function connect edges in polygons. It computes all possible hulls - yes some clusters have more of them when they have a hole in the middle. It then selects one that is outer hull.

Parameters

edges_list (list) – List of edges. Each edge is presented as a tuple of two indices which tell the starting and ending note. The index correspond to location of point in points_list
points_list (list) – List of points location. Each point has x and y location.

Returns

The array with the hull/polygon points.

Return type

np.ndarray

get_epsilon(coordinates, k=10, skip=0.1)[source]

The function computes the epsilon parameter for DBSCAN through method proposed in the paper.

Parameters

coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
k (int) – Number kth observed neighbour
skip (float) – Percentage of skipped neighborus.

Returns

Epsilon parameter for DBSCAN

Return type

float

labels_locations(coordinates, clusters)[source]

Function computes the location of the label for each cluster. The location is compute as a center point.

Parameters

coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clusters (Orange.data.Table) – Cluster indices for each item.

Returns

The coordinates for locating the label. Dictionary with cluster index as a key and tuple (x, y) as a value.

Return type

dict