{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The code below will:\n", "\n", "- Clean descriptor data\n", "- Create the final experimental matrix\n", "- Normalization and apply dimension reduction (PCA)\n", "- Set up machine learning data frame and randomize\n", "- Set up model and execute cross-validation\n", "- Hyperparameter tuning\n", "\n", "*All for the first aim (neat grinding experiments)*\n", "\n", "**Let's start**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Load external data\n", "\n", "1.1. Raw molecular descriptor data of all components" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 127 entries, 2-Pyrrolidinone to 3-Hydroxybenzamide\n", "Columns: 1824 entries, ABC to mZagreb2\n", "dtypes: float64(1458), int64(366)\n", "memory usage: 1.8+ MB\n", "None\n" ] } ], "source": [ "import pandas as pd\n", "\n", "descriptors = pd.read_csv('Chem.csv', index_col=0)\n", "print(descriptors.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.2. Cocrystal combinations and characterisation results" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Component1 1000 non-null object\n", " 1 Component2 1000 non-null object\n", " 2 Outcome 1000 non-null int64 \n", "dtypes: int64(1), object(2)\n", "memory usage: 23.6+ KB\n", "None\n" ] } ], "source": [ "results = pd.read_csv('Results.csv')\n", "print(results.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Descriptor data cleaning\n", "\n", "2.1. Create final data frame and drop columns with missing values" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "descriptor_set = descriptors.dropna(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.2. Drop columns with only a single unique value over all rows" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 127 entries, 2-Pyrrolidinone to 3-Hydroxybenzamide\n", "Columns: 1003 entries, ABC to mZagreb2\n", "dtypes: float64(855), int64(148)\n", "memory usage: 996.2+ KB\n", "None\n" ] } ], "source": [ "non_unique = descriptor_set.apply(pd.Series.nunique)\n", "cols_to_drop = non_unique[non_unique == 1].index\n", "descriptor_set = descriptor_set.drop(cols_to_drop, axis=1)\n", "print(descriptor_set.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new number of descriptors is reduced from 1824 to 1041.\n", "\n", "2.3. Exporting the actual descriptor_set (virtual library in csv)\n", "#descriptor_set.to_csv('descriptor_set.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Create final experimental matrix containing descriptors and results\n", "\n", "3.1. Column descriptors CC_5 for final matrix" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "CC_1 = descriptor_set.columns + '_1'\n", "CC_2 = descriptor_set.columns + '_2'\n", "CC_3 = ['Result']\n", "CC_4 = CC_1.union(CC_2, sort=False)\n", "CC_5 = CC_4.union(CC_3, sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.2. Initialize empty array and set random seed so that shuffle is the same very time" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "CC_e = []\n", "np.random.seed(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.3. Fill empty array with Component 1, Component 2 descriptors and the result (per row/experiment)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils import shuffle\n", "\n", "for i in range(len(results)):\n", " A = np.array(descriptor_set.loc[results.iloc[i][0]])\n", " B = np.array(descriptor_set.loc[results.iloc[i][1]])\n", " RS = [A, B]\n", " C = np.concatenate(shuffle(RS), axis=0)\n", " D = [results.iloc[i][-1]]\n", " E = np.concatenate((C, D), axis=0)\n", " CC_e.append(E)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3.4. Transform array in data frame to obtain final matrix" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Columns: 2007 entries, ABC_1 to Result\n", "dtypes: float64(2007)\n", "memory usage: 15.3 MB\n", "None\n" ] } ], "source": [ "CC = pd.DataFrame(CC_e, columns=CC_5)\n", "print(CC.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Normalize data and apply PCA\n", "\n", "4.1. Normalization of the descriptors" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "x = CC.loc[:, CC_4].values\n", "x = StandardScaler().fit_transform(x)\n", "sep = int((x.shape[1]*0.5))\n", "x1 = (x[:, :sep])\n", "x2 = (x[:, sep:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.2. Normalized data has a mean of 0 and standard deviation of 1 (Proof)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean of the normalized data is 2.2811042962587465e-18 and the standard deviation is 0.9849313330418408.\n" ] } ], "source": [ "print('The mean of the normalized data is {} and the standard deviation is {}.'.format(np.mean(x), np.std(x)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.3. Apply PCA\n", "\n", "4.3.1. For component 1" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca_components = 25\n", "CC_pca1 = PCA(n_components=pca_components)\n", "CC_princomp1 = CC_pca1.fit_transform(x1)\n", "CC_principal_df1 = pd.DataFrame(data=CC_princomp1, columns=['PC_1_' + str(i + 1) for i in range(pca_components)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.3.2. For component 2" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "CC_pca2 = PCA(n_components=pca_components)\n", "CC_princomp2 = CC_pca2.fit_transform(x2)\n", "CC_principal_df2 = pd.DataFrame(data=CC_princomp2, columns=['PC_2_' + str(i + 1) for i in range(pca_components)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4.3.3. Print dimension reduction results" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.888515326421647% of the information is lost by reducing 1003 features to 25.\n", "1.611298821390983% of the information is lost by reducing 1003 features to 25.\n" ] } ], "source": [ "Lost1 = (100*(1-sum(CC_pca1.explained_variance_ratio_)))\n", "nb_des1 = x1.shape[1]\n", "nb_com1 = len(CC_pca1.explained_variance_ratio_)\n", "print('{}% of the information is lost by reducing {} features to {}.'.format(Lost1, nb_des1, nb_com1))\n", "\n", "Lost2 = (100*(1-sum(CC_pca2.explained_variance_ratio_)))\n", "nb_des2 = x2.shape[1]\n", "nb_com2 = len(CC_pca2.explained_variance_ratio_)\n", "print('{}% of the information is lost by reducing {} features to {}.'.format(Lost2, nb_des2, nb_com2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Set up machine learning data frame and randomize\n", "\n", "5.1. Create the final data frame" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "CC_principal_together = pd.concat([CC_principal_df1, CC_principal_df2], axis=1)\n", "CC_df = pd.concat([CC_principal_together, CC.iloc[:, -1]], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5.2. Start randomizing by setting random seed and put rows into random order: permute index randomly and reindex data frame with that result" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "np.random.seed(2)\n", "CC_df = CC_df.reindex(np.random.permutation(CC_df.index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Model evaluation\n", "\n", "6.1. Set up classifiers and choose one from the list" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: xgboost in /opt/anaconda3/lib/python3.8/site-packages (1.3.3)\r\n", "Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.19.2)\r\n", "Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.5.2)\r\n" ] } ], "source": [ "from sklearn import svm\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "! pip install xgboost\n", "import xgboost as xgb\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn import metrics\n", "from sklearn.metrics import accuracy_score, precision_score\n", "from sklearn.metrics import confusion_matrix\n", "\n", "clf = RandomForestClassifier()\n", "# clf = ExtraTreesClassifier()\n", "# clf = AdaBoostClassifier()\n", "# clf = GradientBoostingClassifier()\n", "# clf = svm.SVC()\n", "# clf = xgb.XGBClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6.2. Cross-validation (five-fold) results in mean accuracy and standard deviation" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.708 (+/- 0.0427083130081252)\n", "[0.735 0.715 0.67 0.705 0.715]\n" ] } ], "source": [ "scores = cross_val_score(clf, CC_df.iloc[:, :-1], CC_df.iloc[:, -1], cv=5)\n", "print(\"Accuracy: {} (+/- {})\".format(scores.mean(), scores.std() * 2))\n", "print(scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Hyperparameter tuning\n", "\n", "7.1. Prepare everything for tuning" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: hyperopt in /opt/anaconda3/lib/python3.8/site-packages (0.2.5)\n", "Requirement already satisfied: cloudpickle in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.6.0)\n", "Requirement already satisfied: networkx>=2.2 in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (2.5)\n", "Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.15.0)\n", "Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.19.2)\n", "Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.5.2)\n", "Requirement already satisfied: future in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (0.18.2)\n", "Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (4.50.2)\n", "Requirement already satisfied: decorator>=4.3.0 in /opt/anaconda3/lib/python3.8/site-packages (from networkx>=2.2->hyperopt) (4.4.2)\n" ] } ], "source": [ "! pip install hyperopt\n", "import hyperopt as hp\n", "from hyperopt import Trials, fmin, STATUS_OK\n", "\n", "list_mean = list()\n", "list_std = list()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.2. Defining the objective function" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def objective(params, n_folds=5):\n", " d_train = xgb.DMatrix(CC_df.iloc[:, :-1], CC_df.iloc[:, -1])\n", "\n", " cv_results = xgb.cv(params, d_train, nfold=n_folds, num_boost_round=500,\n", " early_stopping_rounds=25, metrics='auc', seed=0)\n", "\n", " auc = max(cv_results['test-auc-mean'])\n", " list_mean.append(auc)\n", " std = cv_results.iloc[-1][-1]\n", " list_std.append(std)\n", " return 1-auc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.3. Search space \n", "- Verbosity: do not print warning for not included estimators\n", "- max_depth: maximum depth allowed for every tree growth\n", "- subsample: maximum allowed rows for every tree\n", "- n_estimators: number of trees you want to build\n", "- colsample_bytree: maximum allowed features for every tree\n", "- min_child-weight: minimum number of instances required in each node\n", "- reg_alpha: L1 regularisation term on weights\n", "- reg_lambda: L2 regularisation term on weights" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "xgb_space = {\n", " 'verbosity': 0,\n", " 'max_depth': hp.hp.choice('max_depth', np.arange(1, 21, 1, dtype=int)),\n", " 'subsample': hp.hp.quniform('subsample', 0.5, 1.0, 0.05),\n", " 'n_estimators': hp.hp.quniform('n_estimators', 2, 201, 1),\n", " 'colsample_bytree': hp.hp.quniform('colsample_bytree', 0.5, 1.0, 0.05),\n", " 'min_child_weight': hp.hp.uniform('min_child_weight', 1, 50),\n", " 'reg_alpha': hp.hp.uniform('reg_alpha', 0.0, 1.0),\n", " 'reg_lambda': hp.hp.uniform('reg_lambda', 0.0, 1.0)\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.4. Storing the results of every iteration and optimize" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100%|██████████| 200/200 [01:29<00:00, 2.23trial/s, best loss: 0.2009476] \n" ] } ], "source": [ "trials = Trials()\n", "MAX_EVALS = 200\n", "best = fmin(fn=objective, space=xgb_space, algo=hp.tpe.suggest,\n", " max_evals=MAX_EVALS, trials=trials)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7.5. Print best parameter set, mean accuracy and standard deviation" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'colsample_bytree': 0.7000000000000001, 'max_depth': 11, 'min_child_weight': 45.06100407599634, 'n_estimators': 110.0, 'reg_alpha': 0.676410817353635, 'reg_lambda': 0.2314032507189823, 'subsample': 1.0}\n", "Accuracy: 0.7990524 (+/- 0.017233905820794076)\n" ] } ], "source": [ "print(best)\n", "\n", "print(\"Accuracy: {} (+/- {})\".format(max(list_mean), list_std[list_mean.index(max(list_mean))]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Testing a paracetamol set, the original data frame has to be changed to a Paratest set and then the follwoing be executed for accuracy, precision scroes and confusion matrix" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "#clf = xgb.XGBClassifier(xgb_space=best)\n", "#clf.fit(Paratest_df.iloc[36:, 0:-1], Paratest_df.iloc[36:, -1])\n", "#predictions = clf.predict(Paratest_df.iloc[0:35, 0:-1])\n", "#auc1 = roc_auc_score(Paratest_df.iloc[0:35, -1], predictions)\n", "#print(auc1)\n", "\n", "#print('Accuracy Score: ' + str(accuracy_score(Paratest_df.iloc[0:35, -1],predictions)))\n", "#print('Precision Score: ' + str(precision_score(Paratest_df.iloc[0:35, -1],predictions)))\n", "#print('Confusion Matrix: \\n' + str(confusion_matrix(Paratest_df.iloc[0:35, -1],predictions))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }