{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The code below will:\n",
    "\n",
    "- Clean descriptor data\n",
    "- Create the final experimental matrix\n",
    "- Normalization and apply dimension reduction (PCA)\n",
    "- Set up machine learning data frame and randomize\n",
    "- Set up model and execute cross-validation\n",
    "- Hyperparameter tuning\n",
    "\n",
    "*All for the first aim (neat grinding experiments)*\n",
    "\n",
    "**Let's start**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Load external data\n",
    "\n",
    "1.1. Raw molecular descriptor data of all components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 127 entries, 2-Pyrrolidinone to 3-Hydroxybenzamide\n",
      "Columns: 1824 entries, ABC to mZagreb2\n",
      "dtypes: float64(1458), int64(366)\n",
      "memory usage: 1.8+ MB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "descriptors = pd.read_csv('Chem.csv', index_col=0)\n",
    "print(descriptors.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.2. Cocrystal combinations and characterisation results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1000 entries, 0 to 999\n",
      "Data columns (total 3 columns):\n",
      " #   Column      Non-Null Count  Dtype \n",
      "---  ------      --------------  ----- \n",
      " 0   Component1  1000 non-null   object\n",
      " 1   Component2  1000 non-null   object\n",
      " 2   Outcome     1000 non-null   int64 \n",
      "dtypes: int64(1), object(2)\n",
      "memory usage: 23.6+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "results = pd.read_csv('Results.csv')\n",
    "print(results.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Descriptor data cleaning\n",
    "\n",
    "2.1. Create final data frame and drop columns with missing values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "descriptor_set = descriptors.dropna(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.2. Drop columns with only a single unique value over all rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 127 entries, 2-Pyrrolidinone to 3-Hydroxybenzamide\n",
      "Columns: 1003 entries, ABC to mZagreb2\n",
      "dtypes: float64(855), int64(148)\n",
      "memory usage: 996.2+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "non_unique = descriptor_set.apply(pd.Series.nunique)\n",
    "cols_to_drop = non_unique[non_unique == 1].index\n",
    "descriptor_set = descriptor_set.drop(cols_to_drop, axis=1)\n",
    "print(descriptor_set.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new number of descriptors is reduced from 1824 to 1041.\n",
    "\n",
    "2.3. Exporting the actual descriptor_set (virtual library in csv)\n",
    "#descriptor_set.to_csv('descriptor_set.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Create final experimental matrix containing descriptors and results\n",
    "\n",
    "3.1. Column descriptors CC_5 for final matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "CC_1 = descriptor_set.columns + '_1'\n",
    "CC_2 = descriptor_set.columns + '_2'\n",
    "CC_3 = ['Result']\n",
    "CC_4 = CC_1.union(CC_2, sort=False)\n",
    "CC_5 = CC_4.union(CC_3, sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.2. Initialize empty array and set random seed so that shuffle is the same very time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "CC_e = []\n",
    "np.random.seed(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.3. Fill empty array with Component 1, Component 2 descriptors and the result (per row/experiment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import shuffle\n",
    "\n",
    "for i in range(len(results)):\n",
    "    A = np.array(descriptor_set.loc[results.iloc[i][0]])\n",
    "    B = np.array(descriptor_set.loc[results.iloc[i][1]])\n",
    "    RS = [A, B]\n",
    "    C = np.concatenate(shuffle(RS), axis=0)\n",
    "    D = [results.iloc[i][-1]]\n",
    "    E = np.concatenate((C, D), axis=0)\n",
    "    CC_e.append(E)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.4. Transform array in data frame to obtain final matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1000 entries, 0 to 999\n",
      "Columns: 2007 entries, ABC_1 to Result\n",
      "dtypes: float64(2007)\n",
      "memory usage: 15.3 MB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "CC = pd.DataFrame(CC_e, columns=CC_5)\n",
    "print(CC.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Normalize data and apply PCA\n",
    "\n",
    "4.1. Normalization of the descriptors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "x = CC.loc[:, CC_4].values\n",
    "x = StandardScaler().fit_transform(x)\n",
    "sep = int((x.shape[1]*0.5))\n",
    "x1 = (x[:, :sep])\n",
    "x2 = (x[:, sep:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.2. Normalized data has a mean of 0 and standard deviation of 1 (Proof)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The mean of the normalized data is 2.2811042962587465e-18 and the standard deviation is 0.9849313330418408.\n"
     ]
    }
   ],
   "source": [
    "print('The mean of the normalized data is {} and the standard deviation is {}.'.format(np.mean(x), np.std(x)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.3. Apply PCA\n",
    "\n",
    "4.3.1. For component 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca_components = 25\n",
    "CC_pca1 = PCA(n_components=pca_components)\n",
    "CC_princomp1 = CC_pca1.fit_transform(x1)\n",
    "CC_principal_df1 = pd.DataFrame(data=CC_princomp1, columns=['PC_1_' + str(i + 1) for i in range(pca_components)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.3.2. For component 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "CC_pca2 = PCA(n_components=pca_components)\n",
    "CC_princomp2 = CC_pca2.fit_transform(x2)\n",
    "CC_principal_df2 = pd.DataFrame(data=CC_princomp2, columns=['PC_2_' + str(i + 1) for i in range(pca_components)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.3.3. Print dimension reduction results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.888515326421647% of the information is lost by reducing 1003 features to 25.\n",
      "1.611298821390983% of the information is lost by reducing 1003 features to 25.\n"
     ]
    }
   ],
   "source": [
    "Lost1 = (100*(1-sum(CC_pca1.explained_variance_ratio_)))\n",
    "nb_des1 = x1.shape[1]\n",
    "nb_com1 = len(CC_pca1.explained_variance_ratio_)\n",
    "print('{}% of the information is lost by reducing {} features to {}.'.format(Lost1, nb_des1, nb_com1))\n",
    "\n",
    "Lost2 = (100*(1-sum(CC_pca2.explained_variance_ratio_)))\n",
    "nb_des2 = x2.shape[1]\n",
    "nb_com2 = len(CC_pca2.explained_variance_ratio_)\n",
    "print('{}% of the information is lost by reducing {} features to {}.'.format(Lost2, nb_des2, nb_com2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Set up machine learning data frame and randomize\n",
    "\n",
    "5.1. Create the final data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "CC_principal_together = pd.concat([CC_principal_df1, CC_principal_df2], axis=1)\n",
    "CC_df = pd.concat([CC_principal_together, CC.iloc[:, -1]], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.2. Start randomizing by setting random seed and put rows into random order: permute index randomly and reindex data frame with that result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(2)\n",
    "CC_df = CC_df.reindex(np.random.permutation(CC_df.index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Model evaluation\n",
    "\n",
    "6.1. Set up classifiers and choose one from the list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: xgboost in /opt/anaconda3/lib/python3.8/site-packages (1.3.3)\r\n",
      "Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.19.2)\r\n",
      "Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.5.2)\r\n"
     ]
    }
   ],
   "source": [
    "from sklearn import svm\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "! pip install xgboost\n",
    "import xgboost as xgb\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.ensemble import ExtraTreesClassifier\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn import metrics\n",
    "from sklearn.metrics import accuracy_score, precision_score\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "clf = RandomForestClassifier()\n",
    "# clf = ExtraTreesClassifier()\n",
    "# clf = AdaBoostClassifier()\n",
    "# clf = GradientBoostingClassifier()\n",
    "# clf = svm.SVC()\n",
    "# clf = xgb.XGBClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.2. Cross-validation (five-fold) results in mean accuracy and standard deviation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.708 (+/- 0.0427083130081252)\n",
      "[0.735 0.715 0.67  0.705 0.715]\n"
     ]
    }
   ],
   "source": [
    "scores = cross_val_score(clf, CC_df.iloc[:, :-1], CC_df.iloc[:, -1], cv=5)\n",
    "print(\"Accuracy: {} (+/- {})\".format(scores.mean(), scores.std() * 2))\n",
    "print(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7. Hyperparameter tuning\n",
    "\n",
    "7.1. Prepare everything for tuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: hyperopt in /opt/anaconda3/lib/python3.8/site-packages (0.2.5)\n",
      "Requirement already satisfied: cloudpickle in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.6.0)\n",
      "Requirement already satisfied: networkx>=2.2 in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (2.5)\n",
      "Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.15.0)\n",
      "Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.19.2)\n",
      "Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.5.2)\n",
      "Requirement already satisfied: future in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (0.18.2)\n",
      "Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.8/site-packages (from hyperopt) (4.50.2)\n",
      "Requirement already satisfied: decorator>=4.3.0 in /opt/anaconda3/lib/python3.8/site-packages (from networkx>=2.2->hyperopt) (4.4.2)\n"
     ]
    }
   ],
   "source": [
    "! pip install hyperopt\n",
    "import hyperopt as hp\n",
    "from hyperopt import Trials, fmin, STATUS_OK\n",
    "\n",
    "list_mean = list()\n",
    "list_std = list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.2. Defining the objective function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def objective(params, n_folds=5):\n",
    "    d_train = xgb.DMatrix(CC_df.iloc[:, :-1], CC_df.iloc[:, -1])\n",
    "\n",
    "    cv_results = xgb.cv(params, d_train, nfold=n_folds, num_boost_round=500,\n",
    "                        early_stopping_rounds=25, metrics='auc', seed=0)\n",
    "\n",
    "    auc = max(cv_results['test-auc-mean'])\n",
    "    list_mean.append(auc)\n",
    "    std = cv_results.iloc[-1][-1]\n",
    "    list_std.append(std)\n",
    "    return 1-auc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.3. Search space \n",
    "- Verbosity: do not print warning for not included estimators\n",
    "- max_depth: maximum depth allowed for every tree growth\n",
    "- subsample: maximum allowed rows for every tree\n",
    "- n_estimators: number of trees you want to build\n",
    "- colsample_bytree: maximum allowed features for every tree\n",
    "- min_child-weight: minimum number of instances required in each node\n",
    "- reg_alpha: L1 regularisation term on weights\n",
    "- reg_lambda: L2 regularisation term on weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_space = {\n",
    "    'verbosity': 0,\n",
    "    'max_depth': hp.hp.choice('max_depth', np.arange(1, 21, 1, dtype=int)),\n",
    "    'subsample': hp.hp.quniform('subsample', 0.5, 1.0, 0.05),\n",
    "    'n_estimators': hp.hp.quniform('n_estimators', 2, 201, 1),\n",
    "    'colsample_bytree': hp.hp.quniform('colsample_bytree', 0.5, 1.0, 0.05),\n",
    "    'min_child_weight': hp.hp.uniform('min_child_weight', 1, 50),\n",
    "    'reg_alpha': hp.hp.uniform('reg_alpha', 0.0, 1.0),\n",
    "    'reg_lambda': hp.hp.uniform('reg_lambda', 0.0, 1.0)\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.4. Storing the results of every iteration and optimize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 200/200 [01:29<00:00,  2.23trial/s, best loss: 0.2009476]        \n"
     ]
    }
   ],
   "source": [
    "trials = Trials()\n",
    "MAX_EVALS = 200\n",
    "best = fmin(fn=objective, space=xgb_space, algo=hp.tpe.suggest,\n",
    "            max_evals=MAX_EVALS, trials=trials)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.5. Print best parameter set, mean accuracy and standard deviation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'colsample_bytree': 0.7000000000000001, 'max_depth': 11, 'min_child_weight': 45.06100407599634, 'n_estimators': 110.0, 'reg_alpha': 0.676410817353635, 'reg_lambda': 0.2314032507189823, 'subsample': 1.0}\n",
      "Accuracy: 0.7990524 (+/- 0.017233905820794076)\n"
     ]
    }
   ],
   "source": [
    "print(best)\n",
    "\n",
    "print(\"Accuracy: {} (+/- {})\".format(max(list_mean), list_std[list_mean.index(max(list_mean))]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8. Testing a paracetamol set, the original data frame has to be changed to a Paratest set and then the follwoing be executed for accuracy, precision scroes and confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "#clf = xgb.XGBClassifier(xgb_space=best)\n",
    "#clf.fit(Paratest_df.iloc[36:, 0:-1], Paratest_df.iloc[36:, -1])\n",
    "#predictions = clf.predict(Paratest_df.iloc[0:35, 0:-1])\n",
    "#auc1 = roc_auc_score(Paratest_df.iloc[0:35, -1], predictions)\n",
    "#print(auc1)\n",
    "\n",
    "#print('Accuracy Score: ' + str(accuracy_score(Paratest_df.iloc[0:35, -1],predictions)))\n",
    "#print('Precision Score: ' + str(precision_score(Paratest_df.iloc[0:35, -1],predictions)))\n",
    "#print('Confusion Matrix: \\n' + str(confusion_matrix(Paratest_df.iloc[0:35, -1],predictions))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}