API Documentation#

Classifier#

class GPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, save_valid_file=False)[source]#

Wrapper around GPT-3 fine tuning in style of a scikit-learn classifier.

Initialize a GPTClassifier.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
save_valid_file (bool, optional) – Whether to save the validation file. Defaults to False.

fit(X, y)[source]#

Fine tune a GPT-3 model on a dataset.

Parameters:

X (ArrayLike) – Input data (typically array of molecular representations)
y (ArrayLike) – Target data (typically array of property values)

Return type:

None

predict(X)[source]#

Predict property values for a set of molecular representations.

Parameters:: X (ArrayLike) – Input data (typically array of molecular representations)
Returns:: Predicted property values
Return type:: ArrayLike

class NGramGPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, count_vectorizer=None, ngram_model=None)[source]#

Add the predictions of a N-Gram model to the prompt. Empirically, this tends to degrade performance.

Initialize a GPTClassifier.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().

fit(X, y)[source]#

Fine tune a GPT-3 model on a dataset.

Parameters:

X (ArrayLike) – Input data (typically array of molecular representations)
y (ArrayLike) – Target data (typically array of property values)

Return type:

None

predict(X)[source]#

Predict property values for a set of molecular representations.

Parameters:: X (ArrayLike) – Input data (typically array of molecular representations)
Returns:: Predicted property values
Return type:: ArrayLike

class DifficultNGramClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, count_vectorizer=None, ngram_model=None)[source]#

Highlight cases an N-Gram model struggles with.

Initialize a GPTClassifier.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().

fit(X, y)[source]#

Fine tune a GPT-3 model on a dataset.

Parameters:

X (ArrayLike) – Input data (typically array of molecular representations)
y (ArrayLike) – Target data (typically array of property values)

Return type:

None

predict(X)[source]#

Predict property values for a set of molecular representations.

Parameters:: X (ArrayLike) – Input data (typically array of molecular representations)
Returns:: Predicted property values
Return type:: ArrayLike

class MultiRepGPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, rep_names=None)[source]#

GPT Classifier trained on muliple representations.

Initialize a GPTClassifier.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
save_valid_file (bool, optional) – Whether to save the validation file. Defaults to False.

predict(X, return_std=False)[source]#

Predict property values for a set of molecular representations.

Parameters:: X (ArrayLike) – Input data (typically array of molecular representations)
Returns:: Predicted property values
Return type:: ArrayLike

Regressor#

class GPTRegressor(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.RegressionExtractor>)[source]#

Wrapper around GPT-3 fine tuning in style of a scikit-learn regressor.

Initialize a GPTRegressor.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (RegressionExtractor, optional) – Callable object that can extract floats from the completions produced by the querier. Defaults to RegressionExtractor().

fit(X, y)[source]#

Fine tune a GPT-3 model on a dataset.

Parameters:

X (ArrayLike) – Array of molecular representations.
y (ArrayLike) – Array of property values.

Return type:

None

predict(X)[source]#

Predict property values for a set of molecular representations.

Parameters:: X (ArrayLike) – Array of molecular representations.
Returns:: Predicted property values
Return type:: ArrayLike

class BinnedGPTRegressor(property_name, tuner, querier_settings=None, desired_accuracy=0.1, equal_bin_sizes=False, extractor=<gptchem.extractor.ClassificationExtractor>)[source]#

Wrapper around GPT-3 for “regression” by binning the property values in sufficiently many bins.

The predicted property values are the bin centers.

Initialize a BinnedGPTRegressor.

Parameters:

property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
desired_accuracy (float, optional) – Desired accuracy of the binning. Defaults to 0.1.
equal_bin_sizes (bool, optional) – Whether to use equal bin sizes. If False, the bin sizes are chosen such that the number of samples in each bin is approximately equal. Defaults to False.
extractor (ClassificationExtractor, optional) – Callable object that can extract floats from the completions produced by the querier. Defaults to ClassificationExtractor().

fit(X, y)[source]#

Fine tune a GPT-3 model on a dataset.

Parameters:

X (ArrayLike) – Array of molecular representations.
y (ArrayLike) – Array of property values.

Return type:

None

bin_indices_to_ranges(predicted_bin_indices)[source]#

Convert a list of predicted bin indices to a list of bin ranges

Use the bin edges from self.formatter.bins

Parameters:: predicted_bin_indices (ArrayLike) – List of predicted bin indices
Returns:: List of bin range tuples
Return type:: ArrayLike

predict(X, remap=True)[source]#

Predict property values for a set of molecular representations.

Parameters:

X (ArrayLike) – Array of molecular representations.
remap (bool, optional) – Whether to remap the predicted bin indices to the

Returns:

Predicted property values

Return type:

ArrayLike

Formatter#

From the OpenAI Docs:

To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt. Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace. Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion. For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

sanitize_smiles(smi)[source]#

Return a canonical smile representation of smi

Parameters: smi (string) : smile string to be canonicalized

Returns: mol (rdkit.Chem.rdchem.Mol) : RdKit mol object (None if invalid smile string smi) smi_canon (string) : Canonicalized smile representation of smi (None if invalid smile string smi) conversion_successful (bool): True/False to indicate if conversion was successful

mutate_selfie(selfie, max_molecules_len, write_fail_cases=False)[source]#

Return a mutated selfie string (only one mutation on slefie is performed)

Mutations are done until a valid molecule is obtained Rules of mutation: With a 50% propbabily, either:

Add a random SELFIE character in the string

Replace a random SELFIE character with another

Parameters: selfie (string) : SELFIE string to be mutated max_molecules_len (int) : Mutations of SELFIE string are allowed up to this length write_fail_cases (bool) : If true, failed mutations are recorded in “selfie_failure_cases.txt”

Returns: selfie_mutated (string) : Mutated SELFIE string smiles_canon (string) : canonical smile of mutated SELFIE string

get_selfie_chars(selfie)[source]#

Obtain a list of all selfie characters in string selfie

Parameters: selfie (string) : A selfie string - representing a molecule

Example: >>> get_selfie_chars(‘[C][=C][C][=C][C][=C][Ring1][Branch1_1]’) [‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[Ring1]’, ‘[Branch1_1]’]

Returns: chars_selfie: list of selfie characters present in molecule selfie

class ForwardFormatter[source]#

Convert a dataframe to a dataframe of prompts and completions for classification or regression.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

class ClassificationFormatter(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#

Convert a dataframe to a dataframe of prompts and completions for classification.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

We map classes to integers, following the advice from OpenAI’s documentation:

From the OpenAI Docs:

Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.”

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –

property class_names: List[int]#: Names of the classes.

bin(y)[source]#: Bin the inputs based on the bins used for the dataset.

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class ClassifictionFormatterWithExamples(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class RegressionFormatter(representation_column, label_column, property_name, num_digits=2)[source]#

Convert a dataframe to a dataframe of prompts and completions for regression.

The default prompt template is:

{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}

The default completion template is:

{start_completion}{label}{stop_sequence}

By default, the following string replacements are made:

prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”

Initialize a ClassificationFormatter.

Parameters:

representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_digits (int) – The number of digits to round the label to.

format_many(df)[source]#

Format a dataframe of representations and labels into a dataframe of prompts and completions.

This function will drop rows with missing values in the representation or label columns.

Parameters:: df (pd.DataFrame) – A dataframe with a representation column and a label column.
Returns:: A dataframe with a prompt column and a completion column.
Return type:: pd.DataFrame

class InverseFormatter[source]#: From the OpenAI Docs:

Using Lower learning rate and only 1-2 epochs tends to work better for these use cases

Querier#

class Querier(modelname, max_tokens=10)[source]#

Wrapper around the OpenAI API for querying a model for completions.

This class tries to be as efficient as possible by querying the API in batches. It also handles the rate limiting of the API.

Example

>>> querier = Querier("ada")
>>> df = pd.DataFrame({"prompt": ["This is a test", "This is another test"]})
>>> completions = querier.query(df)
>>> assert len(completions) == 2
True
>>> assert all([isinstance(c, str) for c in completions])
True

classmethod from_preset(modelname, preset='classification')[source]#

Factory method to create a Querier from a preset.

These presets set the max_tokens parameter to a value that is appropriate for the task.

query(df, temperature=0, logprobs=None)[source]#

Query the model for completions.

Parameters:

df (pd.DataFrame) – DataFrame containing a column named “prompt”
temperature (float) – Temperature of the softmax. Defaults to 0.
logprobs (Optional[int]) – The number of logprobs to return. For classification, set it to the number of classes. Defaults to None.

Raises:

ValueError – If df is not a pandas DataFrame
ValueError – If df does not have a column named “prompt”
AssertionError – If temperature is < 0

Returns:

Dictionary containing the completions and logprobs

Return type:

dict

Tuner#

class Tuner(base_model='ada', batch_size=None, n_epochs=4, learning_rate_multiplier=None, outdir=None, run_name=None, wandb_sync=True, write_summary=True)[source]#

Wrapper around the OpenAI API for fine tuning.

Initialize a Tuner.

Parameters:

base_model (str) – The base model to fine tune. Defaults to “ada”.
batch_size (Optional[int]) – The batch size to use for fine tuning. Defaults to None.
n_epochs (int) – The number of epochs to fine tune for. Defaults to 4.
learning_rate_multiplier (Optional[float]) – The learning rate multiplier to use for fine tuning. The OpenAI docs state “We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results.” Defaults to None.
outdir (Union[str, Path, None]) – The directory to save the fine tuning results to. If not specified, a directory will be created in BASE_OUTDIR
run_name (Optional[str]) – The name of the run. This is used to create the output directory.
wandb_sync (bool) – Whether to sync the results to Weights & Biases.
write_summary (bool) – Whether to write a summary of the fine tuning run to a file. Defaults to True.

tune(train_df, validation_df=None)[source]#

Fine tune a model on a dataset.

Parameters:

train_df (pd.DataFrame) – Training dataset.
validation_df (pd.DataFrame, optional) – Validation dataset. Defaults to None.

Returns:

Summary of the fine tuning run.

Return type:

dict

Raises:

ValueError – If no training dataset is provided.

Extractor#

class ClassificationExtractor[source]#: Extract integers from completions of classification tasks.

class FewShotClassificationExtractor[source]#: Extract integers from completions of few-shot classification tasks.

class FewShotRegressionExtractor[source]#: Extract floats from completions of few-shot regression tasks.

class RegressionExtractor[source]#: Extract floats from completions of regression tasks.

class InverseExtractor[source]#: Extract strings from completions of inverse tasks.

class SolventExtractor[source]#: Extract solvent name and composition from completions of solvent tasks.

Evaluator#

Data#

get_photoswitch_data()[source]#

Return the photoswitch data as a pandas DataFrame.

Return type:: DataFrame

References

[GriffithsPhotoSwitches] Griffiths, K.; Halcovitch, N. R.; Griffin, J. M. Efficient Solid-State Photoswitching of Methoxyazobenzene in a Metal–Organic Framework for Thermal Energy Storage. Chemical Science 2022, 13 (10), 3014–3019.

get_polymer_data()[source]#

Return the dataset reported in [JablonkaAL].

Return type:: DataFrame

get_moosavi_mof_data()[source]#

Return the data and features used in [MoosaviDiversity].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF.

Return type:: DataFrame

get_moosavi_cv_data()[source]#

Return the gravimetric heat capacity used in [MoosaviCp].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.

Return type:: DataFrame

get_moosavi_pcv_data()[source]#

Return the site-projected heat capacity and features used in [MoosaviCp].

You can find the original datasets on MaterialsCloud archive.

We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.

Return type:: DataFrame

get_qmug_data()[source]#

Return the data and features used in [QMUG].

We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.

Return type:: DataFrame

get_qmug_small_data()[source]#

Return the data and features used in [QMUG].

For the subset of short SMILES.

We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.

Return type:: DataFrame

get_hea_phase_data()[source]#

Return the dataset reported in [Pei].

Return type:: DataFrame

get_opv_data()[source]#

Return the dataset reported in [NagasawaOPV]

Return type:: DataFrame

get_esol_data()[source]#

Return the dataset reported in [ESOL]

Return type:: DataFrame

get_solubility_test_data()[source]#

Return the dataset reported in [soltest]

Return type:: DataFrame

get_doyle_rxn_data()[source]#

Return the reaction dataset reported in [Doyle]

Return type:: DataFrame

get_suzuki_rxn_data()[source]#

Return the reaction dataset reported in [Suzuki]

Return type:: DataFrame

get_freesolv_data()[source]#

Return the FreeSolv data [freesolv]

Return type:: DataFrame

get_lipophilicity_data()[source]#

Return the Lipophilicity data parsed from ChEMBL [chembl]

Return type:: DataFrame

get_mof_solvent_data()[source]#

Return the MOF reaction data []

Return type:: DataFrame

get_matbench_glass()[source]#: Return the glass formation ability dataset from matbench

get_matbench_is_metal()[source]#: Return the is metal dataset from matbench [matbench]

get_matbench_expt_gap()[source]#: Return the experimental band gap dataset from matbench [matbench]

get_matbench_steels()[source]#: Return the steel yield strength dataset from matbench [matbench]

get_water_stability()[source]#: Return the water stability dataset used in [waterStability]