API Documentation#

Classifier#

Regressor#

Formatter#

From the OpenAI Docs:

To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt. Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace. Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion. For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

sanitize_smiles(smi)[source]#

Return a canonical smile representation of smi

Parameters: smi (string) : smile string to be canonicalized

Returns: mol (rdkit.Chem.rdchem.Mol) : RdKit mol object (None if invalid smile string smi) smi_canon (string) : Canonicalized smile representation of smi (None if invalid smile string smi) conversion_successful (bool): True/False to indicate if conversion was successful

mutate_selfie(selfie, max_molecules_len, write_fail_cases=False)[source]#

Return a mutated selfie string (only one mutation on slefie is performed)

Mutations are done until a valid molecule is obtained Rules of mutation: With a 50% propbabily, either:

Add a random SELFIE character in the string

Replace a random SELFIE character with another

Parameters: selfie (string) : SELFIE string to be mutated max_molecules_len (int) : Mutations of SELFIE string are allowed up to this length write_fail_cases (bool) : If true, failed mutations are recorded in “selfie_failure_cases.txt”

Returns: selfie_mutated (string) : Mutated SELFIE string smiles_canon (string) : canonical smile of mutated SELFIE string

get_selfie_chars(selfie)[source]#

Obtain a list of all selfie characters in string selfie

Parameters: selfie (string) : A selfie string - representing a molecule

Example: >>> get_selfie_chars(‘[C][=C][C][=C][C][=C][Ring1][Branch1_1]’) [‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[Ring1]’, ‘[Branch1_1]’]

Returns: chars_selfie: list of selfie characters present in molecule selfie

class ForwardFormatter[source]#

Convert a dataframe to a dataframe of prompts and completions for classification or regression.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

class ClassificationFormatter(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#

Convert a dataframe to a dataframe of prompts and completions for classification.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

We map classes to integers, following the advice from OpenAI’s documentation:

From the OpenAI Docs:

Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.”

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –

property class_names: List[int]#: Names of the classes.

bin(y)[source]#: Bin the inputs based on the bins used for the dataset.

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class ClassifictionFormatterWithExamples(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class RegressionFormatter(representation_column, label_column, property_name, num_digits=2)[source]#

Convert a dataframe to a dataframe of prompts and completions for regression.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_digits (int) – The number of digits to round the label to.

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class InverseFormatter[source]#: From the OpenAI Docs:

Using Lower learning rate and only 1-2 epochs tends to work better for these use cases

Querier#

class Querier(modelname, max_tokens=10)[source]#

Wrapper around the OpenAI API for querying a model for completions.

This class tries to be as efficient as possible by querying the API in batches. It also handles the rate limiting of the API.

Example

>>> querier = Querier("ada")
>>> df = pd.DataFrame({"prompt": ["This is a test", "This is another test"]})
>>> completions = querier.query(df)
>>> assert len(completions) == 2
True
>>> assert all([isinstance(c, str) for c in completions])
True

classmethod from_preset(modelname, preset='classification')[source]#

Factory method to create a Querier from a preset.

These presets set the max_tokens parameter to a value that is appropriate for the task.

query(df, temperature=0, logprobs=None)[source]#

Query the model for completions.

Parameters:

df (pd.DataFrame) – DataFrame containing a column named “prompt”
temperature (float) – Temperature of the softmax. Defaults to 0.
logprobs (Optional[int]) – The number of logprobs to return. For classification, set it to the number of classes. Defaults to None.

Raises:

ValueError – If df is not a pandas DataFrame
ValueError – If df does not have a column named “prompt”
AssertionError – If temperature is < 0

Returns:

Dictionary containing the completions and logprobs

Return type:

dict

Tuner#

Extractor#

class ClassificationExtractor[source]#: Extract integers from completions of classification tasks.

class FewShotClassificationExtractor[source]#: Extract integers from completions of few-shot classification tasks.

class FewShotRegressionExtractor[source]#: Extract floats from completions of few-shot regression tasks.

class RegressionExtractor[source]#: Extract floats from completions of regression tasks.

class InverseExtractor[source]#: Extract strings from completions of inverse tasks.

class SolventExtractor[source]#: Extract solvent name and composition from completions of solvent tasks.

Evaluator#

Data#

get_photoswitch_data()[source]#

Return the photoswitch data as a pandas DataFrame.

Return type:: DataFrame

References

[GriffithsPhotoSwitches] Griffiths, K.; Halcovitch, N. R.; Griffin, J. M. Efficient Solid-State Photoswitching of Methoxyazobenzene in a Metal–Organic Framework for Thermal Energy Storage. Chemical Science 2022, 13 (10), 3014–3019.

get_polymer_data()[source]#

Return the dataset reported in [JablonkaAL].

Return type:: DataFrame

get_moosavi_mof_data()[source]#

Return the data and features used in [MoosaviDiversity].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF.

Return type:: DataFrame

get_moosavi_cv_data()[source]#

Return the gravimetric heat capacity used in [MoosaviCp].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.

Return type:: DataFrame

get_moosavi_pcv_data()[source]#

Return the site-projected heat capacity and features used in [MoosaviCp].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.

Return type:: DataFrame

get_qmug_data()[source]#

Return the data and features used in [QMUG].

We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.

Return type:: DataFrame

get_qmug_small_data()[source]#

Return the data and features used in [QMUG].

For the subset of short SMILES.

We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.

Return type:: DataFrame

get_hea_phase_data()[source]#

Return the dataset reported in [Pei].

Return type:: DataFrame

get_opv_data()[source]#

Return the dataset reported in [NagasawaOPV]

Return type:: DataFrame

get_esol_data()[source]#

Return the dataset reported in [ESOL]

Return type:: DataFrame

get_solubility_test_data()[source]#

Return the dataset reported in [soltest]

Return type:: DataFrame

get_doyle_rxn_data()[source]#

Return the reaction dataset reported in [Doyle]

Return type:: DataFrame

get_suzuki_rxn_data()[source]#

Return the reaction dataset reported in [Suzuki]

Return type:: DataFrame

get_freesolv_data()[source]#

Return the FreeSolv data [freesolv]

Return type:: DataFrame

get_lipophilicity_data()[source]#

Return the Lipophilicity data parsed from ChEMBL [chembl]

Return type:: DataFrame

get_mof_solvent_data()[source]#

Return the MOF reaction data []

Return type:: DataFrame

get_matbench_glass()[source]#: Return the glass formation ability dataset from matbench

get_matbench_is_metal()[source]#: Return the is metal dataset from matbench [matbench]

get_matbench_expt_gap()[source]#: Return the experimental band gap dataset from matbench [matbench]

get_matbench_steels()[source]#: Return the steel yield strength dataset from matbench [matbench]

get_water_stability()[source]#: Return the water stability dataset used in [waterStability]