Child pages
  • Importing Image Analysis Results
Skip to end of metadata
Go to start of metadata

 

Importing Feature Vectors

Typically, an image analysis will create a set of feature vectors, one for each well of the plate.

Metadata Configuration

Image analysis results aggregated on the well level should be stored in datasets of a certain type. Specifically, the dataset type code should always begin with "HCS_ANALYSIS_WELL".

The user can use predefined the type HCS_ANALYSIS_WELL_FEATURES, or create more specific types like HCS_ANALYSIS_WELL_QUALITYHCS_ANALYSIS_WELL_SUMMARY, or HCS_ANALYSIS_WELL_CLASSIFICATION to distinguish different types of image analysis results.

Flexible Importing

To enable importing of the analysis data from a file or a set of files in any format, use the flexible dropbox tool, which is configurable using Python.

Configuring the Datastore Server

Please take note that in order to use the dropbox, the storage process ch.systemsx.cisd.openbis.dss.etl.featurevector.FeatureVectorStorageProcessor or a sub-class of it needs to be configured. Otherwise, the feature data will not be written to the database.

A screening drop box for importing image analysis results should be created as a core plugin of type drop-boxes. The plugin.properties reads:

plugin.properties
incoming-dir = ${incoming-root-dir}/incoming-analysis
incoming-data-completeness-condition = auto-detection
top-level-data-set-handler = ch.systemsx.cisd.openbis.dss.etl.jython.v2.JythonPlateDataSetHandlerV2
script-path = dropbox.py
storage-processor = ch.systemsx.cisd.openbis.dss.etl.featurevector.FeatureVectorStorageProcessor
storage-processor.processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor
storage-processor.data-source = imaging-db

Ensure that incoming-root-dir is defined in service.properties,

Define the Jython dropbox script  in dropbox.py.

If the folder incoming-analysis doesn't exist it will be created on DSS start up.

Jython Dropbox Configuration

To demonstrate how the API defines feature vectors, here is a very simple example which does not read the data from any files, but constructs them in memory instead. You will have to redefine  defineFeatures() extractPlateCode()  and  extractSpaceCode()  methods as follows:

# This is an example Jython dropbox for importing feature vectors coming from analysis of image datasets

import os
from ch.systemsx.cisd.openbis.dss.etl.dto.api.v2 import SimpleFeatureVectorDataConfig 

# Specific code which defines the feature vector values for the dataset..
# Usually you will parse the content of the incoming file or directory to get the values.
# Here all the values are hard-coded for simplicity, 
# but the example shows which calls you need to perform in your parser.
# Parameters 
#     incoming: java.io.File which points to the incoming dataset
def defineFeatures(featuresBuilder, incoming):
        # define INFECTION_INDEX feature
        infectionFeature = featuresBuilder.defineFeature("INFECTION_INDEX")
        # optionally you can set the label and description of the feature
        infectionFeature.setFeatureLabel("Infection Index")
        infectionFeature.setFeatureDescription("What percentage of the cells in the well has been infected?")
        # set values for each well
        infectionFeature.addValue("A1", "3.432")
        # Instead of the well code you can use row and column numbers. For B1 it would be (2,1)
        infectionFeature.addValue(2, 1, "5.343")
        infectionFeature.addValue("C1", "0.987")

        # define QUALITY feature
        qualityFeature = featuresBuilder.defineFeature("QUALITY")
        qualityFeature.addValue("A1", "GOOD")
        qualityFeature.addValue("B1", "BAD")
        qualityFeature.addValue("C1", "GOOD")

# Returns the code of the plate to which the dataset should be connected.
# Parameters 
#     incoming: java.io.File which points to the incoming dataset
def extractPlateCode(incoming):
    return os.path.splitext(incoming.getName())[0]

def extractSpaceCode(incoming):
    return "TEST"

# ----------------------------                
# --- boilerplate code which register one dataset with image analysis results on the well level
# --- Nothing has to be modified if your case is not complicated.
# ----------------------------                

def process(transaction): 
    incoming = transaction.getIncoming()
    config = SimpleFeatureVectorDataConfig()
    featuresBuilder = config.featuresBuilder
    defineFeatures(featuresBuilder, incoming) 

    analysisDataset = transaction.createNewFeatureVectorDataSet(config, incoming)
    # set analysis procedure (optional)
    analysisDataset.setAnalysisProcedure("MY-PROCEDURE-08-15")

    # set plate to which the dataset should be connected
    sampleIdentifier = "/"+extractSpaceCode(incoming)+"/"+extractPlateCode(incoming)
    plate = transaction.getSample(sampleIdentifier)
    analysisDataset.setSample(plate)

    # store the original file in the dataset.
    transaction.moveFile(incoming.getPath(), analysisDataset)

    # ----------------------------                   
    # --- optional: other standard operations on analysisDataset can be performed (see IDataSet interface)
    # ----------------------------                

    #analysisDataset.setFileFormatType("CSV")
    #analysisDataset.setDataSetType("HCS_ANALYSIS_WELL_FEATURES")
    #analysisDataset.setPropertyValue("DESCRIPTION", incoming.getName())
    #analysisDataset.setParentDatasets(["20110302085840150-90"])

To see how it works:

  • Create a file named MY-PLATE.csv
  • Create a plate named MY-PLATE in TEST space.
  • Copy the file to the incoming-analysis folder
  • Go to the web browser and display the MY-PLATE detail view. (You may need to refresh the page in order to see it). You should be able to display a heatmap for the registered features.

Documentation:

  • featuresBuilder object implements IFeaturesBuilder interface
  • featuresBuilder.defineFeature("<FEATURE_NAME>") returns an object implementing IFeatureDefinition interface

Here is an example of the defineFeatures() implementation. It parses .csv files like this one.

SEPARATOR = ","

def defineFeatures(featuresBuilder, incoming):
    file = open(incoming.getPath())
    for header in file:
        headerTokens = header.split(SEPARATOR)
        featureCode = headerTokens[0]
        featureValues = featuresBuilder.defineFeature(featureCode)
        for rowValues in file:
            rowTokens = rowValues.split(SEPARATOR)
            rowLabel = rowTokens[0].strip()
            if len(rowLabel) == 0:
                break
            for column in range(1,len(headerTokens)):
                value = rowTokens[column].strip()
                well = rowLabel + str(column)
                featureValues.addValue(well, value)

Importing features for timepoint or depth-scan series

Here is a simple example to demonstrate how to import feature vectors for each timepoint series.
The values will be imported to the database, but, since the current user interface does not yet support viewing multiple timepoints, only the values of the first timepoint will be shown.

def defineFeatures(featuresBuilder, incoming):
        # define INFECTION_INDEX feature
        infectionFeature = featuresBuilder.defineFeature("INFECTION_INDEX")
        # Define the feature values for the timepoint 100. 
        # The second argument is the depth and can be used if depth-scans are performed.
        infectionFeature.changeSeries(100, None)
        infectionFeature.addValue("A1", "3.432")
        infectionFeature.addValue("B1", "5.343")
        infectionFeature.addValue("C1", "0.987")
        # Define the feature values for the timepoint 200. 
        infectionFeature.changeSeries(200, None)
        infectionFeature.addValue("A1", "1.652")
        infectionFeature.addValue("B1", "2.321")
        infectionFeature.addValue("C1", "0.121")

Analysis Procedures

In HCS scenarios, image datasets can be analysed several times using different algorithms. This is useful in many cases, such as when bioinformaticians make serial improvements to the analysis procedure, or they wish to try different analysis approaches.

Example: Imagine that all the plates of an assay have been analysed using 3 different algorithms: A, B and C. In openBIS, these algorithms are called 'analysis procedures'.

openBIS offers a way to aggregate all of the analysis results for a particular gene or compound, such as when it finds all the wells where a particular gene has been screened and for each feature calculates the median value of all replicas.
It is clear that if we have 3 sets of analysis results for a plate, and each is produced with a different algorithm, they should not be mixed with each other when calculating aggregates.
Additionally, when the analysis result dataset for the plate is requested (e.g. through the API), then specifying the analysis procedure name helps return a unique result.

For situations such as these, openBIS makes it possible to say which analysis procedure has been used to product each analysis dataset. Internally, this information is stored as a dataset property $ANALYSIS_PROCEDURE.

To set the analysis procedure in the dropbox, use the default mechanism for setting properties, but call it using the following recommended method:

analysisDataset.setAnalysisProcedure("MY-PROCEDURE-08-15")

The analysis procedure should be set at least for:

  • Well-level image analysis datasets (HCS_ANALYSIS_WELL* type)
  • Image segmentation datasets (HCS_IMAGE_SEGMENTATION* type)

openBIS offers views in the web browser to utilize this information.

To summarize:

  • Two datasets should have the same analysis procedure only if the algorithm to compute them was the same and the results are comparable
  • It does not make sense to have two datasets of the same plate with the same analysis procedure

Features Lists

It is very common situation, when there is plenty of different features registered. In such case it is becoming difficult to handle the amount of data. To make it easier, it is possible to register features lists, that are grouping together couple of features. By selecting one of the lists, user filters out all the features that are not included on such list.

Registering new features list

Features lists are registered per feature vector data set. It means, that every feature vector data set can have different set of connected features lists. New features list can be registered via dropbox. Below you can find simple example of dropbox code.

from ch.systemsx.cisd.openbis.dss.etl.dto.api.v2 import FeatureListDataConfig 


def process(transaction):
    container = transaction.getDataSetForUpdate(transaction.getIncoming().getName())
    config = FeatureListDataConfig()
    config.setName("Example Features List");
    config.setFeatureList(["FEATURE_1", "FEATURE_2"])
    config.setContainerDataSet(container)
    transaction.createNewFeatureListDataSet(config)

The dropbox above registers a new features list called Example Features List. The list consists of 2 features: FEATURE_1 and FEATURE_2.

Every dropbox code that registers features list needs to perform following steps:

  1. New instance of ch.systemsx.cisd.openbis.dss.etl.dto.api.v2.FeatureListDataConfig config object should be created.
  2. The name needs to be defined. To do this, method setName on config object should be called.
  3. List of feature codes that should be included in the list should be specified. To do this, method setFeatureList should be called on the config object.
  4. Feature vector container data set needs to be specified be calling setContainerDataSet method on the config object.
  5. When all the configuration is set, as a final step method createNewFeatureListDataSet should be called on transaction object.

Dealing with Features Lists

Features Lists are visible on the Plate View. On the right side of Choose heatmap kind   dropdown list, there is Choose features list dropdown list with all the features lists registered for given feature vector data set. When All is chosen, all heatmap kinds are available, but when user selects one of the features lists the list of heatmap kinds gets narrowed to values available on the list. When user decides to remove one of the features lists, he can go to the feature vector data set view on the Contained tab, and simply delete no necessary features lists data sets.

Value Types

Features can be either float numbers or strings. A feature is either one or the other, it cannot be a mix. The value type is determined automatically by openBIS: if all values provided can be parsed to a float, then the feature will be a float feature, otherwise it will be a string feature. Note that setting a value like NaN will make a feature to become a string feature! See below on how to handle missing values correctly.

 

Missing Values

Missing values are provided by simply not setting a value. In essence, all feature values start out as a missing value and you need to set it for a given location to change that. Do not set a value of NaN or N/A to set a missing value as this will lead the feature vector to be misinterpreted as a string type feature.

Importing from a CSV file

File Format of Feature Vector Data

The file is a CSV or TSV file. The actual separator can be specified in DSS configuration. The first line contains the headers of the columns. Each following line contains the feature values for one well. The well is denoted by one or two columns. Their names have to be specified. If both names are identical there is only one column denoting wells notated by a combination of letters and digits. Otherwise the row column contains letters or a number and the column a number. A feature is a column with numerical values. An unknown value of feature for a particular well can be denoted by NaN. Columns with non-numerical values (like bar codes) are ignored.

A column header defines the label as well as the code of a feature vector. The label is used for output (column header in tables, axis label in plots). The code is used for data retrieval e.g. when defining a custom column or filter in an openBIS table. By default label is just the header and the code is a normalized label. Normalization is done as follows:

  1. Label is converted to upper case.
  2. All special characters (i.e. characters which are not letters between A-Z or digits) are replace by an underscore character '_'.

If code should be different from normalized label the following syntax has to be used for the column header in the feature vector file: <code> label . Note, that the actual code is defined by normalized code . If label is missing the actual label will be the same as the actual code.

Here are some valid but incomplete examples:

  1. row,col,cellNumber,<hitRate> Hit Rate,feature1,<feature2>
    A,1,25,0.0030874,NaN
    B,1,181,0.9562,3.45E-06
    A,2,85,0.57143,0.004177
    B,2,53,0.15483,2.38E-07
    
  2. WellName,cellNumber,hitRate
    A01,25,1.0653
    B01,181,0.9562
    A02,85,0.57143
    B02,53,0.15483
    

Another example can be downloaded here.

Configuration Parameters

In the plugin.properties a FeatureVectorStorageProceessor has to be configured for uploading feature vector files of above-mentioned format. The following properties control variations of this format:

Property Key

Default Value

Description

separator

;

Separator character between headers and row cells.

ignore-comments

true

If true lines starting with '#' will be ignored.

well-name-row

WellName

Header of the column denoting the row of a well.

well-name-col

WellName

Header of the column denoting the column of a well.

Importing other types of analysis results

openBIS can also store the results of other types of image analysis. Such datasets can be attached to each plate or each well.
The user can download such a dataset later on and browse it on his computer. More sophisticated functionality may be added later.
Following types of datasets should be used:

  • HCS_ANALYSIS_CELL_SEGMENTATION for HCS image analysis cell segmentation results
  • HCS_ANALYSIS_CELL_FEATURES for HCS image analysis cell feature vectors results
  • HCS_ANALYSIS_CELL_CLASS for HCS image analysis cell classification results
  • No labels