Sample Annotator
- class AnnotateSamples[source]
AnnotateSamples is class used for the annotation of data items with the annotations (e.g. Cell Types). We used the Mann-Whitney U test for selecting important values and the Hyper-geometric for assigning the annotations.
Example on biological data where we assign cell types to cells:
>>> gene_expressions_df = pd.read_csv("data/DC_expMatrix_DCnMono.csv.gz", ... compression='gzip') >>> marker_genes_df = pd.read_csv("data/panglao_gene_markers.csv.gz", ... compression="gzip") >>> # rename genes column and filter human markers >>> marker_genes_df = marker_genes_df[ ... marker_genes_df["Organism"] == "Human"] >>> >>> annotations = AnnotateSamples.annotate_samples( ... gene_expressions_df, marker_genes_df, num_all_attributes=60000, ... attributes_col="Cell Type", annotations_col="Name", ... p_threshold=0.05)
Example for full manual annotation. Here annotation is split in three phases. We assume that data are already loaded.
>>> z = AnnotateSamples.mann_whitney_test(gene_expressions_df) >>> scores, p_val = AnnotateSamples.assign_annotations( ... z, marker_genes_df, gene_expressions_df, num_all_attributes=60000, ... attributes_col="Cell Type", annotations_col="Name") >>> scores = AnnotateSamples.filter_annotations( ... scores, p_val, p_threshold=0.05)
- static annotate_samples(data, available_annotations, num_all_attributes=None, attributes_col='Attributes', annotations_col='Annotations', return_nonzero_annotations=True, p_threshold=0.05, p_value_fun='binom', z_threshold=1, scoring='scoring_exp_ratio', normalize=False)[source]
Function marks the data with annotations that are provided. This function implements the complete functionality. First select attributes for each item with z_test, then annotate data and filter data.
- Parameters
data (pd.DataFrame) – Tabular data
available_annotations (pd.DataFrame) – Available annotations (e.g. cell types), this data frame has two columns: attributes column name is set by attributes_col variable (default: Attributes) and annotations is set by annotations_col variable (default: Annotations).
num_all_attributes (int) – The number of all attributes for a case (also those that do not appear in the data). In the case of genes, it is the number of all genes that an organism has. It is recommended to set your value, in cases when the value is not set the number of attributes in z_values table will be used.
return_nonzero_annotations (bool, optional (default=True)) – If true return scores for only annotations present in at least one sample.
attributes_col (str) – The name of an attributes column in available_annotations (default: Attributes”).
annotations_col (str) – The name of an annotations column in available_annotations (default: Annotations”).
p_threshold (float) – A threshold for accepting the annotations. Annotations that has FDR value bellow this threshold are used.
p_value_fun (str, optional (defaults: TEST_BINOMIAL)) – A function that calculates p-value. It can be either PFUN_BINOMIAL that uses statistics.Binomial().p_value or PFUN_HYPERGEOMETRIC that uses hypergeom.sf.
z_threshold (float) – The threshold for selecting the attribute. For each item the attributes with z-value above this value are selected.
scoring (str, optional (default = SCORING_EXP_RATIO)) – Type of scoring
normalize (bool, optional (default = False)) – This variable tells whether to normalize data or not.
- Returns
Scores table - each line of the table has scores that tell how probable is that items have specific annotations.
- Return type
pd.DataFrame
- static assign_annotations(z_values, available_annotations, data, num_all_attributes=None, attributes_col='Attributes', annotations_col='Annotations', z_threshold=1, p_value_fun='binom', scoring='scoring_exp_ratio')[source]
The function gets a set of attributes (e.g. genes) for each item and attributes for each annotation. It returns the annotations significant for each item.
- Parameters
z_values (pd.DataFrame) – DataFrame that shows z values for each item. Rows are data items and columns are attributes.
available_annotations (pd.DataFrame) – Available annotations (e.g. cell types), this data frame has two columns: attributes column name is set by attributes_col variable (default: Attributes) and annotations is set by annotations_col variable (default: Annotations).
data (pd.DataFrame) – Tabular input (raw) data - we need that to compute scores.
num_all_attributes (int) – The number of all attributes for a case (also those that do not appear in the data). In the case of genes, it is the number of all genes that an organism has. It is recommended to set your value, in cases when the value is not set the number of attributes in z_values table will be used.
attributes_col (str) – The name of an attributes column in available_annotations (default: Attributes”).
annotations_col (str) – The name of an annotations column in available_annotations (default: Annotations”).
z_threshold (float) – The threshold for selecting the attribute. For each item, the attributes with z-value above this value are selected.
p_value_fun (str, optional (defaults: TEST_BINOMIAL)) – A function that calculates the p-value. It can be either PFUN_BINOMIAL that uses binom.sf or PFUN_HYPERGEOMETRIC that uses hypergeom.sf.
scoring (str, optional (default=SCORING_EXP_RATIO)) – Type of scoring
- Returns
pd.DataFrame – Annotation probabilities
pd.DataFrame – Annotation FDRS.
- static filter_annotations(scores, p_values, return_nonzero_annotations=True, p_threshold=0.05)[source]
This function filters the probabilities on places that do not reach the threshold for p-value and filter zero columns if return_nonzero_annotations is True.
- Parameters
scores (pd.DataFrame) – Scores for each annotation for data items
p_values (pd.DataFrame) – p-value scores for annotations for data items
return_nonzero_annotations (bool) – The flag that enables filtering the non-zero columns.
p_threshold (float) – A threshold for accepting the annotations. Annotations that have FDR value bellow this threshold are used.
- Returns
Filtered scores for each annotation for data items
- Return type
pd.Dataframe
Projection Annotator
This module cluster the projection of data (usually it is 2D projection) with one of the standard algorithms and attach a certain number of labels per cluster.
Example:
>>> from Orange.projection import TSNE
>>> from Orange.data import Table
>>> from orangecontrib.bioinformatics.utils import serverfiles
>>> from orangecontrib.bioinformatics.annotation.annotate_projection import ... annotate_projection
>>> from orangecontrib.bioinformatics.annotation.annotate_samples import ... AnnotateSamples
>>>
>>> # load data
>>> data = Table("https://datasets.orange.biolab.si/sc/aml-1k.tab.gz")
>>> marker_p = serverfiles.localpath_download(
... 'marker_genes','panglao_gene_markers.tab')
>>> markers = Table(marker_p)
>>>
>>> # annotate data with labels
>>> annotator = AnnotateSamples()
>>> annotations = annotator.annotate_samples(data, markers)
>>>
>>> # project data in 2D
>>> tsne = TSNE(n_components=2)
>>> tsne_model = tsne(data)
>>> embedding = tsne_model(data)
>>>
>>> # get clusters and annotations for clusters
>>> clusters, clusters_meta, eps = annotate_projection(annotations, embedding,
... clustering_algorithm=DBSCAN, eps=1.2)
In case when user uses a DBSCAN algorithm and do not provide eps to the annotate_projection function it is computed automatically with knn method.
>>> clusters, clusters_meta, eps = annotate_projection(annotations, embedding,
... clustering_algorithm=DBSCAN)
- annotate_projection(annotations, coordinates, clustering_algorithm=<class 'Orange.clustering.dbscan.DBSCAN'>, labels_per_cluster=3, **kwargs)[source]
Function cluster the data based on coordinates, and assigns a certain number of labels per cluster. Each cluster gets labels_per_cluster number of most common labels in cluster assigned.
- Parameters
annotations (Orange.data.Table) – Table with annotations and their probabilities.
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clustering_algorithm (callable, optional (default = DBSCAN)) – Algorithm used in clustering.
labels_per_cluster (int, optional (default = 3)) – Number of labels that need to be assigned to each cluster.
- Returns
Orange.data.Table – List of cluster indices.
dict – Dictionary with cluster index as a key and list of annotations as a value. Each list include tuples with the annotation name and their proportion in the cluster.
dict – The coordinates for locating the label. Dictionary with cluster index as a key and tuple (x, y) as a value.
- assign_labels(clusters, annotations, labels_per_cluster)[source]
This function assigns a certain number of labels per cluster. Each cluster gets labels_per_cluster number of most common labels in cluster assigned.
- Parameters
clusters (Orange.data.Table) – Cluster indices for each item.
annotations (Orange.data.Table) – Table with annotations and their probabilities.
labels_per_cluster (int) – Number of labels that need to be assigned to each cluster.
- Returns
dict – Dictionary with cluster index as a key and list of annotations as a value. Each list include tuples with the annotation name and their proportion in the cluster.
Orange.data.Table – The array with the annotation assigned to the item.
- cluster_additional_points(coordinates, hulls, cluster_attribute=None)[source]
This function receives additional points and assign them current existing clusters based on current concave hull.
- Parameters
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
hulls (dict) – Concave hull for each cluster
cluster_attribute (Orange.data.DiscreteVariable (optional)) – A variable for clusters. If cluster_attribute is provided it will be used in the creation of the resulting Table.
- Returns
Cluster label for each point
- Return type
Orange.data.Table
- cluster_data(coordinates, clustering_algorithm=<class 'Orange.clustering.dbscan.DBSCAN'>, **kwargs)[source]
This function receives data and cluster them.
- Parameters
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clustering_algorithm (callable) – Algorithm used for clustering.
- Returns
List of cluster indices.
- Return type
Orange.data.Table
- compute_concave_hulls(coordinates, clusters, epsilon)[source]
Function computes the points of the concave hull around points.
- Parameters
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clusters (Orange.data.Table) – Cluster indices for each item.
epsilon (float) – Epsilon used by DBSCAN to cluster the data
- Returns
The points of the concave hull. Dictionary with cluster index as a key and np.ndaray of points as a value - [[x1, y1], [x2, y2], [x3, y3], …]
- Return type
dict
- edges_to_polygon(edges_list, points_list)[source]
This function connect edges in polygons. It computes all possible hulls - yes some clusters have more of them when they have a hole in the middle. It then selects one that is outer hull.
- Parameters
edges_list (list) – List of edges. Each edge is presented as a tuple of two indices which tell the starting and ending note. The index correspond to location of point in points_list
points_list (list) – List of points location. Each point has x and y location.
- Returns
The array with the hull/polygon points.
- Return type
np.ndarray
- get_epsilon(coordinates, k=10, skip=0.1)[source]
The function computes the epsilon parameter for DBSCAN through method proposed in the paper.
- Parameters
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
k (int) – Number kth observed neighbour
skip (float) – Percentage of skipped neighborus.
- Returns
Epsilon parameter for DBSCAN
- Return type
float
- labels_locations(coordinates, clusters)[source]
Function computes the location of the label for each cluster. The location is compute as a center point.
- Parameters
coordinates (Orange.data.Table) – Visualisation coordinates - embeddings
clusters (Orange.data.Table) – Cluster indices for each item.
- Returns
The coordinates for locating the label. Dictionary with cluster index as a key and tuple (x, y) as a value.
- Return type
dict