API Documentation#
Classifier#
Regressor#
Formatter#
From the OpenAI Docs:
To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.
Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n
. The separator should not appear elsewhere in any prompt.
Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n
, ###
, or any other token that does not appear in any completion.
For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.
- sanitize_smiles(smi)[source]#
Return a canonical smile representation of smi
Parameters: smi (string) : smile string to be canonicalized
Returns: mol (rdkit.Chem.rdchem.Mol) : RdKit mol object (None if invalid smile string smi) smi_canon (string) : Canonicalized smile representation of smi (None if invalid smile string smi) conversion_successful (bool): True/False to indicate if conversion was successful
- mutate_selfie(selfie, max_molecules_len, write_fail_cases=False)[source]#
Return a mutated selfie string (only one mutation on slefie is performed)
Mutations are done until a valid molecule is obtained Rules of mutation: With a 50% propbabily, either:
Add a random SELFIE character in the string
Replace a random SELFIE character with another
Parameters: selfie (string) : SELFIE string to be mutated max_molecules_len (int) : Mutations of SELFIE string are allowed up to this length write_fail_cases (bool) : If true, failed mutations are recorded in “selfie_failure_cases.txt”
Returns: selfie_mutated (string) : Mutated SELFIE string smiles_canon (string) : canonical smile of mutated SELFIE string
- get_selfie_chars(selfie)[source]#
Obtain a list of all selfie characters in string selfie
Parameters: selfie (string) : A selfie string - representing a molecule
Example: >>> get_selfie_chars(‘[C][=C][C][=C][C][=C][Ring1][Branch1_1]’) [‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[Ring1]’, ‘[Branch1_1]’]
Returns: chars_selfie: list of selfie characters present in molecule selfie
- class ForwardFormatter[source]#
Convert a dataframe to a dataframe of prompts and completions for classification or regression.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
- class ClassificationFormatter(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#
Convert a dataframe to a dataframe of prompts and completions for classification.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
We map classes to integers, following the advice from OpenAI’s documentation:
From the OpenAI Docs:
Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.”
Initialize a ClassificationFormatter.
- Parameters:
representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
- class ClassifictionFormatterWithExamples(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#
Initialize a ClassificationFormatter.
- Parameters:
representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
- class RegressionFormatter(representation_column, label_column, property_name, num_digits=2)[source]#
Convert a dataframe to a dataframe of prompts and completions for regression.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
Initialize a ClassificationFormatter.
- Parameters:
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
Querier#
- class Querier(modelname, max_tokens=10)[source]#
Wrapper around the OpenAI API for querying a model for completions.
This class tries to be as efficient as possible by querying the API in batches. It also handles the rate limiting of the API.
Example
>>> querier = Querier("ada") >>> df = pd.DataFrame({"prompt": ["This is a test", "This is another test"]}) >>> completions = querier.query(df) >>> assert len(completions) == 2 True >>> assert all([isinstance(c, str) for c in completions]) True
- classmethod from_preset(modelname, preset='classification')[source]#
Factory method to create a Querier from a preset.
These presets set the max_tokens parameter to a value that is appropriate for the task.
- query(df, temperature=0, logprobs=None)[source]#
Query the model for completions.
- Parameters:
- Raises:
ValueError – If df is not a pandas DataFrame
ValueError – If df does not have a column named “prompt”
AssertionError – If temperature is < 0
- Returns:
Dictionary containing the completions and logprobs
- Return type:
Tuner#
Extractor#
- class FewShotClassificationExtractor[source]#
Extract integers from completions of few-shot classification tasks.
Evaluator#
Data#
- get_photoswitch_data()[source]#
Return the photoswitch data as a pandas DataFrame.
- Return type:
References
[GriffithsPhotoSwitches] Griffiths, K.; Halcovitch, N. R.; Griffin, J. M. Efficient Solid-State Photoswitching of Methoxyazobenzene in a Metal–Organic Framework for Thermal Energy Storage. Chemical Science 2022, 13 (10), 3014–3019.
- get_polymer_data()[source]#
Return the dataset reported in [JablonkaAL].
- Return type:
- get_moosavi_mof_data()[source]#
Return the data and features used in [MoosaviDiversity].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF.
- Return type:
- get_moosavi_cv_data()[source]#
Return the gravimetric heat capacity used in [MoosaviCp].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.
- Return type:
- get_moosavi_pcv_data()[source]#
Return the site-projected heat capacity and features used in [MoosaviCp].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.
- Return type:
- get_qmug_data()[source]#
Return the data and features used in [QMUG].
We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.
- Return type:
- get_qmug_small_data()[source]#
Return the data and features used in [QMUG].
For the subset of short SMILES.
We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.
- Return type:
- get_opv_data()[source]#
Return the dataset reported in [NagasawaOPV]
- Return type:
- get_freesolv_data()[source]#
Return the FreeSolv data [freesolv]
- Return type:
- get_lipophilicity_data()[source]#
Return the Lipophilicity data parsed from ChEMBL [chembl]
- Return type:
- get_matbench_is_metal()[source]#
Return the is metal dataset from matbench [matbench]
- get_matbench_expt_gap()[source]#
Return the experimental band gap dataset from matbench [matbench]
- get_matbench_steels()[source]#
Return the steel yield strength dataset from matbench [matbench]
- get_water_stability()[source]#
Return the water stability dataset used in [waterStability]