API Documentation#
Classifier#
- class GPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, save_valid_file=False)[source]#
Wrapper around GPT-3 fine tuning in style of a scikit-learn classifier.
Initialize a GPTClassifier.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
save_valid_file (bool, optional) – Whether to save the validation file. Defaults to False.
- class NGramGPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, count_vectorizer=None, ngram_model=None)[source]#
Add the predictions of a N-Gram model to the prompt. Empirically, this tends to degrade performance.
Initialize a GPTClassifier.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
- class DifficultNGramClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, count_vectorizer=None, ngram_model=None)[source]#
Highlight cases an N-Gram model struggles with.
Initialize a GPTClassifier.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
- class MultiRepGPTClassifier(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.ClassificationExtractor>, rep_names=None)[source]#
GPT Classifier trained on muliple representations.
Initialize a GPTClassifier.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (ClassificationExtractor, optional) – Callable object that can extract integers from the completions produced by the querier. Defaults to ClassificationExtractor().
save_valid_file (bool, optional) – Whether to save the validation file. Defaults to False.
Regressor#
- class GPTRegressor(property_name, tuner, querier_settings=None, extractor=<gptchem.extractor.RegressionExtractor>)[source]#
Wrapper around GPT-3 fine tuning in style of a scikit-learn regressor.
Initialize a GPTRegressor.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
extractor (RegressionExtractor, optional) – Callable object that can extract floats from the completions produced by the querier. Defaults to RegressionExtractor().
- class BinnedGPTRegressor(property_name, tuner, querier_settings=None, desired_accuracy=0.1, equal_bin_sizes=False, extractor=<gptchem.extractor.ClassificationExtractor>)[source]#
Wrapper around GPT-3 for “regression” by binning the property values in sufficiently many bins.
The predicted property values are the bin centers.
Initialize a BinnedGPTRegressor.
- Parameters:
property_name (str) – Name of the property to be predicted. This will be part of the prompt.
tuner (Tuner) – Tuner object to be used for fine tuning. This specifies the model to be used and the fine-tuning settings.
querier_settings (Optional[dict], optional) – Settings for the querier. Defaults to None.
desired_accuracy (float, optional) – Desired accuracy of the binning. Defaults to 0.1.
equal_bin_sizes (bool, optional) – Whether to use equal bin sizes. If False, the bin sizes are chosen such that the number of samples in each bin is approximately equal. Defaults to False.
extractor (ClassificationExtractor, optional) – Callable object that can extract floats from the completions produced by the querier. Defaults to ClassificationExtractor().
- fit(X, y)[source]#
Fine tune a GPT-3 model on a dataset.
- Parameters:
X (ArrayLike) – Array of molecular representations.
y (ArrayLike) – Array of property values.
- Return type:
Formatter#
From the OpenAI Docs:
To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.
Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n
. The separator should not appear elsewhere in any prompt.
Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n
, ###
, or any other token that does not appear in any completion.
For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.
- sanitize_smiles(smi)[source]#
Return a canonical smile representation of smi
Parameters: smi (string) : smile string to be canonicalized
Returns: mol (rdkit.Chem.rdchem.Mol) : RdKit mol object (None if invalid smile string smi) smi_canon (string) : Canonicalized smile representation of smi (None if invalid smile string smi) conversion_successful (bool): True/False to indicate if conversion was successful
- mutate_selfie(selfie, max_molecules_len, write_fail_cases=False)[source]#
Return a mutated selfie string (only one mutation on slefie is performed)
Mutations are done until a valid molecule is obtained Rules of mutation: With a 50% propbabily, either:
Add a random SELFIE character in the string
Replace a random SELFIE character with another
Parameters: selfie (string) : SELFIE string to be mutated max_molecules_len (int) : Mutations of SELFIE string are allowed up to this length write_fail_cases (bool) : If true, failed mutations are recorded in “selfie_failure_cases.txt”
Returns: selfie_mutated (string) : Mutated SELFIE string smiles_canon (string) : canonical smile of mutated SELFIE string
- get_selfie_chars(selfie)[source]#
Obtain a list of all selfie characters in string selfie
Parameters: selfie (string) : A selfie string - representing a molecule
Example: >>> get_selfie_chars(‘[C][=C][C][=C][C][=C][Ring1][Branch1_1]’) [‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[C]’, ‘[=C]’, ‘[Ring1]’, ‘[Branch1_1]’]
Returns: chars_selfie: list of selfie characters present in molecule selfie
- class ForwardFormatter[source]#
Convert a dataframe to a dataframe of prompts and completions for classification or regression.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
- class ClassificationFormatter(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#
Convert a dataframe to a dataframe of prompts and completions for classification.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
We map classes to integers, following the advice from OpenAI’s documentation:
From the OpenAI Docs:
Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.”
Initialize a ClassificationFormatter.
- Parameters:
representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
- class ClassifictionFormatterWithExamples(representation_column, label_column, property_name, num_classes=None, qcut=True, representation_name='')[source]#
Initialize a ClassificationFormatter.
- Parameters:
representation_column (str) – The column name of the representation.
label_column (str) – The column name of the label.
property_name (str) – The name of the property.
num_classes (int, optional) – The number of classes.
qcut (bool) – Whether to use qcut to split the label into classes. Otherwise, cut is used.
representation_name (str) name of the representation (e.g. "SMILES") –
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
- class RegressionFormatter(representation_column, label_column, property_name, num_digits=2)[source]#
Convert a dataframe to a dataframe of prompts and completions for regression.
- The default prompt template is:
{prefix}What is the {propertyname} of {representation}{suffix}{end_prompt}
- The default completion template is:
{start_completion}{label}{stop_sequence}
- By default, the following string replacements are made:
prefix -> “”
suffix -> “?”
end_prompt -> “###”
start_completion -> “ “
stop_sequence -> “@@@”
Initialize a ClassificationFormatter.
- Parameters:
- format_many(df)[source]#
Format a dataframe of representations and labels into a dataframe of prompts and completions.
This function will drop rows with missing values in the representation or label columns.
- Parameters:
df (pd.DataFrame) – A dataframe with a representation column and a label column.
- Returns:
A dataframe with a prompt column and a completion column.
- Return type:
pd.DataFrame
Querier#
- class Querier(modelname, max_tokens=10)[source]#
Wrapper around the OpenAI API for querying a model for completions.
This class tries to be as efficient as possible by querying the API in batches. It also handles the rate limiting of the API.
Example
>>> querier = Querier("ada") >>> df = pd.DataFrame({"prompt": ["This is a test", "This is another test"]}) >>> completions = querier.query(df) >>> assert len(completions) == 2 True >>> assert all([isinstance(c, str) for c in completions]) True
- classmethod from_preset(modelname, preset='classification')[source]#
Factory method to create a Querier from a preset.
These presets set the max_tokens parameter to a value that is appropriate for the task.
- query(df, temperature=0, logprobs=None)[source]#
Query the model for completions.
- Parameters:
- Raises:
ValueError – If df is not a pandas DataFrame
ValueError – If df does not have a column named “prompt”
AssertionError – If temperature is < 0
- Returns:
Dictionary containing the completions and logprobs
- Return type:
Tuner#
- class Tuner(base_model='ada', batch_size=None, n_epochs=4, learning_rate_multiplier=None, outdir=None, run_name=None, wandb_sync=True, write_summary=True)[source]#
Wrapper around the OpenAI API for fine tuning.
Initialize a Tuner.
- Parameters:
base_model (
str
) – The base model to fine tune. Defaults to “ada”.batch_size (
Optional
[int
]) – The batch size to use for fine tuning. Defaults to None.n_epochs (
int
) – The number of epochs to fine tune for. Defaults to 4.learning_rate_multiplier (
Optional
[float
]) – The learning rate multiplier to use for fine tuning. The OpenAI docs state “We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results.” Defaults to None.outdir (
Union
[str
,Path
,None
]) – The directory to save the fine tuning results to. If not specified, a directory will be created in BASE_OUTDIRrun_name (
Optional
[str
]) – The name of the run. This is used to create the output directory.wandb_sync (
bool
) – Whether to sync the results to Weights & Biases.write_summary (
bool
) – Whether to write a summary of the fine tuning run to a file. Defaults to True.
- tune(train_df, validation_df=None)[source]#
Fine tune a model on a dataset.
- Parameters:
train_df (pd.DataFrame) – Training dataset.
validation_df (pd.DataFrame, optional) – Validation dataset. Defaults to None.
- Returns:
Summary of the fine tuning run.
- Return type:
- Raises:
ValueError – If no training dataset is provided.
Extractor#
- class FewShotClassificationExtractor[source]#
Extract integers from completions of few-shot classification tasks.
Evaluator#
Data#
- get_photoswitch_data()[source]#
Return the photoswitch data as a pandas DataFrame.
- Return type:
References
[GriffithsPhotoSwitches] Griffiths, K.; Halcovitch, N. R.; Griffin, J. M. Efficient Solid-State Photoswitching of Methoxyazobenzene in a Metal–Organic Framework for Thermal Energy Storage. Chemical Science 2022, 13 (10), 3014–3019.
- get_polymer_data()[source]#
Return the dataset reported in [JablonkaAL].
- Return type:
- get_moosavi_mof_data()[source]#
Return the data and features used in [MoosaviDiversity].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF.
- Return type:
- get_moosavi_cv_data()[source]#
Return the gravimetric heat capacity used in [MoosaviCp].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.
- Return type:
- get_moosavi_pcv_data()[source]#
Return the site-projected heat capacity and features used in [MoosaviCp].
You can find the original datasets on MaterialsCloud archive.
We additionally computed the MOFid [BuciorMOFid] for each MOF and dropped entries for which we could not compute the MOFid.
- Return type:
- get_qmug_data()[source]#
Return the data and features used in [QMUG].
We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.
- Return type:
- get_qmug_small_data()[source]#
Return the data and features used in [QMUG].
For the subset of short SMILES.
We mean-aggregrated the numerical data per SMILES and additionally computed SELFIES and INChI.
- Return type:
- get_opv_data()[source]#
Return the dataset reported in [NagasawaOPV]
- Return type:
- get_freesolv_data()[source]#
Return the FreeSolv data [freesolv]
- Return type:
- get_lipophilicity_data()[source]#
Return the Lipophilicity data parsed from ChEMBL [chembl]
- Return type:
- get_matbench_is_metal()[source]#
Return the is metal dataset from matbench [matbench]
- get_matbench_expt_gap()[source]#
Return the experimental band gap dataset from matbench [matbench]
- get_matbench_steels()[source]#
Return the steel yield strength dataset from matbench [matbench]
- get_water_stability()[source]#
Return the water stability dataset used in [waterStability]