Commit 1f324723
Changed files (17)
examples
embeddings
openai
examples/embeddings/Classification.ipynb
@@ -0,0 +1,130 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Classification using the embeddings\n",
+ "\n",
+ "In the classification task we predict one of the predefined categories given an input. We will predict the score based on the embedding of the review's text, where the algorithm is correct only if it guesses the exact number of stars. We split the dataset into a training and a testing set for all the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
+ "\n",
+ "In the following example we're predicting the number of stars in a review, from 1 to 5."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " 1 0.82 0.67 0.74 21\n",
+ " 2 0.50 0.50 0.50 6\n",
+ " 3 1.00 0.46 0.63 13\n",
+ " 4 0.75 0.35 0.48 17\n",
+ " 5 0.88 1.00 0.93 143\n",
+ "\n",
+ " accuracy 0.86 200\n",
+ " macro avg 0.79 0.60 0.66 200\n",
+ "weighted avg 0.86 0.86 0.84 200\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import classification_report, accuracy_score\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
+ "\n",
+ "X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n",
+ "\n",
+ "clf = RandomForestClassifier(n_estimators=100)\n",
+ "clf.fit(X_train, y_train)\n",
+ "preds = clf.predict(X_test)\n",
+ "probas = clf.predict_proba(X_test)\n",
+ "\n",
+ "report = classification_report(y_test, preds)\n",
+ "print(report)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the model has learnt to distinguish between the categories decently. 5-star reviews show the best performance overall, and this is not too surprising, since they are the most common in the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "RandomForestClassifier() - Average precision score over all classes: 0.93\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 648x720 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from utils import plot_multiclass_precision_recall\n",
+ "\n",
+ "plot_multiclass_precision_recall(probas, y_test, [1,2,3,4,5], clf)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Unsurprisingly 5-star and 1-star reviews seem to be easier to predict. Perhaps with more data, the nuances between 2-4 stars could be better predicted, but there's also probably more subjectivity in how people use the inbetween scores."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Clustering.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Clustering\n",
+ "\n",
+ "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(1000, 2048)"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
+ "matrix = np.vstack(df.babbage_similarity.values)\n",
+ "matrix.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Find the clusters using K-means"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Cluster\n",
+ "2 2.543478\n",
+ "3 4.374046\n",
+ "0 4.709402\n",
+ "1 4.832099\n",
+ "Name: Score, dtype: float64"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from sklearn.cluster import KMeans\n",
+ "\n",
+ "n_clusters = 4\n",
+ "\n",
+ "kmeans = KMeans(n_clusters = n_clusters,init='k-means++',random_state=42)\n",
+ "kmeans.fit(matrix)\n",
+ "labels = kmeans.labels_\n",
+ "df['Cluster'] = labels\n",
+ "\n",
+ "df.groupby('Cluster').Score.mean().sort_values()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It looks like cluster 2 focused on negative reviews, while cluster 0 and 1 focused on positive reviews."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from sklearn.manifold import TSNE\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)\n",
+ "vis_dims2 = tsne.fit_transform(matrix)\n",
+ "\n",
+ "x = [x for x,y in vis_dims2]\n",
+ "y = [y for x,y in vis_dims2]\n",
+ "\n",
+ "for category, color in enumerate(['purple', 'green', 'red', 'blue']):\n",
+ " xs = np.array(x)[df.Cluster==category]\n",
+ " ys = np.array(y)[df.Cluster==category]\n",
+ " plt.scatter(xs, ys, color=color, alpha=0.3)\n",
+ "\n",
+ " avg_x = xs.mean()\n",
+ " avg_y = ys.mean()\n",
+ " \n",
+ " plt.scatter(avg_x, avg_y, marker='x', color=color, s=100)\n",
+ "plt.title(\"Clusters identified visualized in language 2d using t-SNE\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Visualization of clusters in a 2d projection. The red cluster clearly represents negative reviews. The blue cluster seems quite different from the others. Let's see a few samples from each cluster."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Text samples in the clusters & naming the clusters\n",
+ "\n",
+ "Let's show random samples from each cluster. We'll use davinci-instruct-beta-v3 to name the clusters, based on a random sample of 6 reviews from that cluster."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Cluster 0 Theme: All of the customer reviews mention the great flavor of the product.\n",
+ "5, French Vanilla Cappuccino: Great price. Really love the the flavor. No need to add anything to \n",
+ "5, great coffee: A bit pricey once you add the S & H but this is one of the best flavor\n",
+ "5, Love It: First let me say I'm new to drinking tea. So you're not getting a well\n",
+ "----------------------------------------------------------------------------------------------------\n",
+ "Cluster 1 Theme: All three reviews mention the quality of the product.\n",
+ "5, Beautiful: I don't plan to grind these, have plenty other peppers for that. I go\n",
+ "5, Awesome: I can't find this in the stores and thought I would like it. So I bou\n",
+ "5, Came as expected: It was tasty and fresh. The other one I bought was old and tasted mold\n",
+ "----------------------------------------------------------------------------------------------------\n",
+ "Cluster 2 Theme: All reviews are about customer's disappointment.\n",
+ "1, Disappointed...: I should read the fine print, I guess. I mostly went by the picture a\n",
+ "5, Excellent but Price?: I first heard about this on America's Test Kitchen where it won a blin\n",
+ "1, Disappointed: I received the offer from Amazon and had never tried this brand before\n",
+ "----------------------------------------------------------------------------------------------------\n",
+ "Cluster 3 Theme: The reviews for these products have in common that the customers' dogs love them.\n",
+ "5, My Dog's Favorite Snack!: I was first introduced to this snack at my dog's training classes at p\n",
+ "4, Fruitables Crunchy Dog Treats: My lab goes wild for these and I am almost tempted to have a go at som\n",
+ "5, Happy with the product: My dog was suffering with itchy skin. He had been eating Natural Choi\n",
+ "----------------------------------------------------------------------------------------------------\n"
+ ]
+ }
+ ],
+ "source": [
+ "import openai\n",
+ "\n",
+ "# Reading a review which belong to each group.\n",
+ "rev_per_cluster = 3\n",
+ "\n",
+ "for i in range(n_clusters):\n",
+ " print(f\"Cluster {i} Theme:\", end=\" \")\n",
+ " \n",
+ " reviews = \"\\n\".join(df[df.Cluster == i].combined.str.replace(\"Title: \", \"\").str.replace(\"\\n\\nContent: \", \": \").sample(rev_per_cluster, random_state=42).values)\n",
+ " response = openai.Completion.create(\n",
+ " engine=\"davinci-instruct-beta-v3\",\n",
+ " prompt=f\"What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\\\"\\\"\\\"\\n{reviews}\\n\\\"\\\"\\\"\\n\\nTheme:\",\n",
+ " temperature=0,\n",
+ " max_tokens=64,\n",
+ " top_p=1,\n",
+ " frequency_penalty=0,\n",
+ " presence_penalty=0\n",
+ " )\n",
+ " print(response[\"choices\"][0][\"text\"].replace('\\n',''))\n",
+ "\n",
+ " sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42) \n",
+ " for j in range(rev_per_cluster):\n",
+ " print(sample_cluster_rows.Score.values[j], end=\", \")\n",
+ " print(sample_cluster_rows.Summary.values[j], end=\": \")\n",
+ " print(sample_cluster_rows.Text.str[:70].values[j])\n",
+ " \n",
+ " print(\"-\" * 100)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see based on the average ratings per cluster, that Cluster 2 contains mostly negative reviews. Cluster 0 and 1 contain mostly positive reviews, whilst Cluster 3 appears to contain reviews about dog products."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Code_search.ipynb
@@ -0,0 +1,396 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Code search\n",
+ "\n",
+ "We index our own openai-python code repository, and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Total number of py files: 40\n",
+ "Total number of functions extracted: 64\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "from glob import glob\n",
+ "import pandas as pd\n",
+ "\n",
+ "def get_function_name(code):\n",
+ " \"\"\"\n",
+ " Extract function name from a line beginning with \"def \"\n",
+ " \"\"\"\n",
+ " assert code.startswith(\"def \")\n",
+ " return code[len(\"def \"): code.index(\"(\")]\n",
+ "\n",
+ "def get_until_no_space(all_lines, i) -> str:\n",
+ " \"\"\"\n",
+ " Get all lines until a line outside the function definition is found.\n",
+ " \"\"\"\n",
+ " ret = [all_lines[i]]\n",
+ " for j in range(i + 1, i + 10000):\n",
+ " if j < len(all_lines):\n",
+ " if len(all_lines[j]) == 0 or all_lines[j][0] in [\" \", \"\\t\", \")\"]:\n",
+ " ret.append(all_lines[j])\n",
+ " else:\n",
+ " break\n",
+ " return \"\\n\".join(ret)\n",
+ "\n",
+ "def get_functions(filepath):\n",
+ " \"\"\"\n",
+ " Get all functions in a Python file.\n",
+ " \"\"\"\n",
+ " whole_code = open(filepath).read().replace(\"\\r\", \"\\n\")\n",
+ " all_lines = whole_code.split(\"\\n\")\n",
+ " for i, l in enumerate(all_lines):\n",
+ " if l.startswith(\"def \"):\n",
+ " code = get_until_no_space(all_lines, i)\n",
+ " function_name = get_function_name(code)\n",
+ " yield {\"code\": code, \"function_name\": function_name, \"filepath\": filepath}\n",
+ "\n",
+ "\n",
+ "# get user root directory\n",
+ "root_dir = os.path.expanduser(\"~\")\n",
+ "\n",
+ "# path to code repository directory\n",
+ "code_root = root_dir + \"/openai-python\"\n",
+ "code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
+ "print(\"Total number of py files:\", len(code_files))\n",
+ "all_funcs = []\n",
+ "for code_file in code_files:\n",
+ " funcs = list(get_functions(code_file))\n",
+ " for func in funcs:\n",
+ " all_funcs.append(func)\n",
+ "\n",
+ "print(\"Total number of functions extracted:\", len(all_funcs))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For code search models we use babbage-code-search-code to obtain embeddings for code snippets, and code-search-text to embed natural language queries."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>code</th>\n",
+ " <th>function_name</th>\n",
+ " <th>filepath</th>\n",
+ " <th>code_embedding</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>def semantic_search(engine, query, documents):...</td>\n",
+ " <td>semantic_search</td>\n",
+ " <td>/examples/semanticsearch/semanticsearch.py</td>\n",
+ " <td>[-0.038976121693849564, -0.0031428150832653046...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>def main():\\n parser = argparse.ArgumentPar...</td>\n",
+ " <td>main</td>\n",
+ " <td>/examples/semanticsearch/semanticsearch.py</td>\n",
+ " <td>[-0.024289356544613838, -0.017748363316059113,...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>def get_candidates(\\n prompt: str,\\n sto...</td>\n",
+ " <td>get_candidates</td>\n",
+ " <td>/examples/codex/backtranslation.py</td>\n",
+ " <td>[-0.04161201789975166, -0.0169310811907053, 0....</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>def rindex(lst: List, value: str) -> int:\\n ...</td>\n",
+ " <td>rindex</td>\n",
+ " <td>/examples/codex/backtranslation.py</td>\n",
+ " <td>[-0.027255680412054062, -0.007931121625006199,...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>def eval_candidate(\\n candidate_answer: str...</td>\n",
+ " <td>eval_candidate</td>\n",
+ " <td>/examples/codex/backtranslation.py</td>\n",
+ " <td>[-0.00999179296195507, -0.01640152558684349, 0...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " code function_name \\\n",
+ "0 def semantic_search(engine, query, documents):... semantic_search \n",
+ "1 def main():\\n parser = argparse.ArgumentPar... main \n",
+ "2 def get_candidates(\\n prompt: str,\\n sto... get_candidates \n",
+ "3 def rindex(lst: List, value: str) -> int:\\n ... rindex \n",
+ "4 def eval_candidate(\\n candidate_answer: str... eval_candidate \n",
+ "\n",
+ " filepath \\\n",
+ "0 /examples/semanticsearch/semanticsearch.py \n",
+ "1 /examples/semanticsearch/semanticsearch.py \n",
+ "2 /examples/codex/backtranslation.py \n",
+ "3 /examples/codex/backtranslation.py \n",
+ "4 /examples/codex/backtranslation.py \n",
+ "\n",
+ " code_embedding \n",
+ "0 [-0.038976121693849564, -0.0031428150832653046... \n",
+ "1 [-0.024289356544613838, -0.017748363316059113,... \n",
+ "2 [-0.04161201789975166, -0.0169310811907053, 0.... \n",
+ "3 [-0.027255680412054062, -0.007931121625006199,... \n",
+ "4 [-0.00999179296195507, -0.01640152558684349, 0... "
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from utils import get_embedding\n",
+ "\n",
+ "df = pd.DataFrame(all_funcs)\n",
+ "df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='babbage-code-search-code'))\n",
+ "df['filepath'] = df['filepath'].apply(lambda x: x.replace(code_root, \"\"))\n",
+ "df.to_csv(\"output/code_search_openai-python.csv\", index=False)\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.681\n",
+ "def test_completions_multiple_prompts():\n",
+ " result = openai.Completion.create(\n",
+ " prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
+ " )\n",
+ " assert len(result.choices) == 10\n",
+ "\n",
+ "----------------------------------------------------------------------\n",
+ "/openai/tests/test_endpoints.py:test_completions score=0.675\n",
+ "def test_completions():\n",
+ " result = openai.Completion.create(prompt=\"This was a test\", n=5, engine=\"ada\")\n",
+ " assert len(result.choices) == 5\n",
+ "\n",
+ "\n",
+ "----------------------------------------------------------------------\n",
+ "/openai/tests/test_api_requestor.py:test_requestor_sets_request_id score=0.635\n",
+ "def test_requestor_sets_request_id(mocker: MockerFixture) -> None:\n",
+ " # Fake out 'requests' and confirm that the X-Request-Id header is set.\n",
+ "\n",
+ " got_headers = {}\n",
+ "\n",
+ " def fake_request(self, *args, **kwargs):\n",
+ " nonlocal got_headers\n",
+ "----------------------------------------------------------------------\n"
+ ]
+ }
+ ],
+ "source": [
+ "from utils import cosine_similarity\n",
+ "\n",
+ "def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n",
+ " embedding = get_embedding(code_query, engine='babbage-code-search-text')\n",
+ " df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n",
+ "\n",
+ " res = df.sort_values('similarities', ascending=False).head(n)\n",
+ " if pprint:\n",
+ " for r in res.iterrows():\n",
+ " print(r[1].filepath+\":\"+r[1].function_name + \" score=\" + str(round(r[1].similarities, 3)))\n",
+ " print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n",
+ " print('-'*70)\n",
+ " return res\n",
+ "res = search_functions(df, 'Completions API tests', n=3)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/openai/validators.py:format_inferrer_validator score=0.655\n",
+ "def format_inferrer_validator(df):\n",
+ " \"\"\"\n",
+ " This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
+ " It will also suggest to use ada, --no_packing and explain train/validation split benefits.\n",
+ " \"\"\"\n",
+ " ft_type = infer_task_type(df)\n",
+ " immediate_msg = None\n",
+ "----------------------------------------------------------------------\n",
+ "/openai/validators.py:long_examples_validator score=0.649\n",
+ "def long_examples_validator(df):\n",
+ " \"\"\"\n",
+ " This validator will suggest to the user to remove examples that are too long.\n",
+ " \"\"\"\n",
+ " immediate_msg = None\n",
+ " optional_msg = None\n",
+ " optional_fn = None\n",
+ "----------------------------------------------------------------------\n",
+ "/openai/validators.py:non_empty_completion_validator score=0.646\n",
+ "def non_empty_completion_validator(df):\n",
+ " \"\"\"\n",
+ " This validator will ensure that no completion is empty.\n",
+ " \"\"\"\n",
+ " necessary_msg = None\n",
+ " necessary_fn = None\n",
+ " immediate_msg = None\n",
+ "----------------------------------------------------------------------\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_functions(df, 'fine-tuning input data validation logic', n=3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/openai/validators.py:common_completion_suffix_validator score=0.665\n",
+ "def common_completion_suffix_validator(df):\n",
+ " \"\"\"\n",
+ " This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
+ " \"\"\"\n",
+ " error_msg = None\n",
+ " immediate_msg = None\n",
+ " optional_msg = None\n",
+ " optional_fn = None\n",
+ "\n",
+ " ft_type = infer_task_type(df)\n",
+ "----------------------------------------------------------------------\n",
+ "/openai/validators.py:get_outfnames score=0.66\n",
+ "def get_outfnames(fname, split):\n",
+ " suffixes = [\"_train\", \"_valid\"] if split else [\"\"]\n",
+ " i = 0\n",
+ " while True:\n",
+ " index_suffix = f\" ({i})\" if i > 0 else \"\"\n",
+ " candidate_fnames = [\n",
+ " fname.split(\".\")[0] + \"_prepared\" + suffix + index_suffix + \".jsonl\"\n",
+ " for suffix in suffixes\n",
+ " ]\n",
+ " if not any(os.path.isfile(f) for f in candidate_fnames):\n",
+ "----------------------------------------------------------------------\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_functions(df, 'find common suffix', n=2, n_lines=10)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/openai/cli.py:tools_register score=0.651\n",
+ "def tools_register(parser):\n",
+ " subparsers = parser.add_subparsers(\n",
+ " title=\"Tools\", help=\"Convenience client side tools\"\n",
+ " )\n",
+ "\n",
+ " def help(args):\n",
+ " parser.print_help()\n",
+ "\n",
+ " parser.set_defaults(func=help)\n",
+ "\n",
+ " sub = subparsers.add_parser(\"fine_tunes.prepare_data\")\n",
+ " sub.add_argument(\n",
+ " \"-f\",\n",
+ " \"--file\",\n",
+ " required=True,\n",
+ " help=\"JSONL, JSON, CSV, TSV, TXT or XLSX file containing prompt-completion examples to be analyzed.\"\n",
+ " \"This should be the local file path.\",\n",
+ " )\n",
+ " sub.add_argument(\n",
+ " \"-q\",\n",
+ "----------------------------------------------------------------------\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_functions(df, 'Command line interface for fine-tuning', n=1, n_lines=20)"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Get_embeddings.ipynb
@@ -0,0 +1,107 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Get embeddings\n",
+ "\n",
+ "The function `get_embedding` will give us an embedding for an input text."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "12288"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import openai\n",
+ "\n",
+ "embedding = openai.Engine(id=\"davinci-similarity\").embeddings(input=\"Sample document text goes here\")['data'][0]['embedding']\n",
+ "len(embedding)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1024\n"
+ ]
+ }
+ ],
+ "source": [
+ "import openai\n",
+ "from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
+ "\n",
+ "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
+ "def get_embedding(text, engine=\"davinci-similarity\"):\n",
+ "\n",
+ " # replace newlines, which can negatively affect performance.\n",
+ " text = text.replace(\"\\n\", \" \")\n",
+ "\n",
+ " return openai.Engine(id=engine).embeddings(input = [text])['data'][0]['embedding']\n",
+ "\n",
+ "embedding = get_embedding(\"Sample query text goes here\", engine=\"ada-search-query\")\n",
+ "print(len(embedding))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1024\n"
+ ]
+ }
+ ],
+ "source": [
+ "embedding = get_embedding(\"Sample document text goes here\", engine=\"ada-search-document\")\n",
+ "print(len(embedding))"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Obtain_dataset.ipynb
@@ -0,0 +1,192 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Load the dataset\n",
+ "\n",
+ "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
+ "\n",
+ "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Time</th>\n",
+ " <th>ProductId</th>\n",
+ " <th>UserId</th>\n",
+ " <th>Score</th>\n",
+ " <th>Summary</th>\n",
+ " <th>Text</th>\n",
+ " <th>combined</th>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Id</th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " <th></th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>1303862400</td>\n",
+ " <td>B001E4KFG0</td>\n",
+ " <td>A3SGXH7AUHU8GW</td>\n",
+ " <td>5</td>\n",
+ " <td>Good Quality Dog Food</td>\n",
+ " <td>I have bought several of the Vitality canned d...</td>\n",
+ " <td>Title: Good Quality Dog Food; Content: I have ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>1346976000</td>\n",
+ " <td>B00813GRG4</td>\n",
+ " <td>A1D87F6ZCVE5NK</td>\n",
+ " <td>1</td>\n",
+ " <td>Not as Advertised</td>\n",
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
+ " <td>Title: Not as Advertised; Content: Product arr...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Time ProductId UserId Score Summary \\\n",
+ "Id \n",
+ "1 1303862400 B001E4KFG0 A3SGXH7AUHU8GW 5 Good Quality Dog Food \n",
+ "2 1346976000 B00813GRG4 A1D87F6ZCVE5NK 1 Not as Advertised \n",
+ "\n",
+ " Text \\\n",
+ "Id \n",
+ "1 I have bought several of the Vitality canned d... \n",
+ "2 Product arrived labeled as Jumbo Salted Peanut... \n",
+ "\n",
+ " combined \n",
+ "Id \n",
+ "1 Title: Good Quality Dog Food; Content: I have ... \n",
+ "2 Title: Not as Advertised; Content: Product arr... "
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "df = pd.read_csv('input/Reviews.csv', index_col=0)\n",
+ "df = df[['Time', 'ProductId', 'UserId', 'Score', 'Summary', 'Text']]\n",
+ "df = df.dropna()\n",
+ "df['combined'] = \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n",
+ "df.head(2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1000"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# subsample to 1k most recent reviews and remove samples that are too long\n",
+ "df = df.sort_values('Time').tail(1_100)\n",
+ "df.drop('Time', axis=1, inplace=True)\n",
+ "\n",
+ "from transformers import GPT2TokenizerFast\n",
+ "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
+ "\n",
+ "# remove reviews that are too long\n",
+ "df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
+ "df = df[df.n_tokens<2000].tail(1_000)\n",
+ "len(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Get embeddings and save them for future reuse"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from utils import get_embedding\n",
+ "\n",
+ "# This will take just under 10 minutes\n",
+ "df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='babbage-similarity'))\n",
+ "df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='babbage-search-document'))\n",
+ "df.to_csv('output/embedded_1k_reviews.csv')"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Regression.ipynb
@@ -0,0 +1,109 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Regression using the embeddings\n",
+ "\n",
+ "Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
+ "\n",
+ "We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Babbage similarity embedding performance on 1k Amazon reviews: mse=0.38, mae=0.39\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "from sklearn.ensemble import RandomForestRegressor\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
+ "\n",
+ "X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n",
+ "\n",
+ "rfr = RandomForestRegressor(n_estimators=100)\n",
+ "rfr.fit(X_train, y_train)\n",
+ "preds = rfr.predict(X_test)\n",
+ "\n",
+ "\n",
+ "mse = mean_squared_error(y_test, preds)\n",
+ "mae = mean_absolute_error(y_test, preds)\n",
+ "\n",
+ "print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dummy mean prediction performance on Amazon reviews: mse=1.77, mae=1.04\n"
+ ]
+ }
+ ],
+ "source": [
+ "bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
+ "bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
+ "print(f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Semantic_text_search_using_embeddings.ipynb
@@ -0,0 +1,185 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Semantic text search using embeddings\n",
+ "\n",
+ "We can search through all our reviews semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "df['babbage_search'] = df.babbage_search.apply(eval).apply(np.array)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Remember to use the documents embedding engine for documents (in this case reviews), and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
+ "\n",
+ "Good Buy: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!\n",
+ "\n",
+ "Fantastic Instant Refried beans: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from utils import get_embedding, cosine_similarity\n",
+ "\n",
+ "# search through the reviews for a specific product\n",
+ "def search_reviews(df, product_description, n=3, pprint=True):\n",
+ " embedding = get_embedding(product_description, engine='babbage-search-query')\n",
+ " df['similarities'] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))\n",
+ "\n",
+ " res = df.sort_values('similarities', ascending=False).head(n).combined.str.replace('Title: ','').str.replace('; Content:', ': ')\n",
+ " if pprint:\n",
+ " for r in res:\n",
+ " print(r[:200])\n",
+ " print()\n",
+ " return res\n",
+ "res = search_reviews(df, 'delicious beans', n=3)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Rustichella ROCKS!: Anything this company makes is worthwhile eating! My favorite is their Trenne.<br />Their whole wheat pasta is the best I have ever had.\n",
+ "\n",
+ "sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
+ "\n",
+ "Wonderful: Came quickly. Was plentiful and delicious and cheaper than in the store. You will enjoy it if you like thick pasta.\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_reviews(df, 'whole wheat pasta', n=3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "great product, poor delivery: The coffee is excellent and I am a repeat buyer. Problem this time was with the UPS delivery. They left the box in front of my garage door in the middle of the drivewa\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_reviews(df, 'bad delivery', n=1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Extremely dissapointed: Hi,<br />I am very disappointed with the past shipment I received of the ONE coconut water. 3 of the boxes were leaking and the coconut water was spoiled.<br /><br />Thanks.<b\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_reviews(df, 'spoilt', n=1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
+ "\n",
+ "A great deal on Greenies: Paid only $22 with free shipping for 96 teenies compared to about $35 at the pet store. How can you go wrong with a deal like that? The dog begs for his daily Greenie. Got \n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "res = search_reviews(df, 'pet food', n=2)"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/User_and_product_embeddings.ipynb
@@ -0,0 +1,177 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## User and product embeddings\n",
+ "\n",
+ "We calculate user and product embeddings based on the training set, and evaluate the results on the unseen test set. We will evaluate the results by plotting the user and product similarity versus the review score. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Calculate user and product embeddings\n",
+ "\n",
+ "We calculate these embeddings simply by averaging all the reviews about the same product or written by the same user within the training set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(24502, 19035)"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_babbage_similarity_50k.csv', index_col=0)\n",
+ "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
+ "X_train, X_test, y_train, y_test = train_test_split(df, df.Score, test_size = 0.2, random_state=42)\n",
+ "\n",
+ "user_embeddings = X_train.groupby('UserId').babbage_similarity.apply(np.mean)\n",
+ "prod_embeddings = X_train.groupby('ProductId').babbage_similarity.apply(np.mean)\n",
+ "len(user_embeddings), len(prod_embeddings)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that most of the users and products appear within the 50k examples only once."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Evaluate the embeddings\n",
+ "\n",
+ "To evaluate the recommendations, we look at the similarity of the user and product embeddings amongst the reviews in the unseen test set. We calculate the cosine distance between the user and product embeddings, which gives us a similarity score between 0 and 1. We then normalize the scores to be evenly split between 0 and 1, by calculating the percentile of the similarity score amongst all predicted scores."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from utils import cosine_similarity\n",
+ "\n",
+ "# evaluate embeddings as recommendations on X_test\n",
+ "def evaluate_single_match(row):\n",
+ " user_id = row.UserId\n",
+ " product_id = row.ProductId\n",
+ " try:\n",
+ " user_embedding = user_embeddings[user_id]\n",
+ " product_embedding = prod_embeddings[product_id]\n",
+ " similarity = cosine_similarity(user_embedding, product_embedding)\n",
+ " return similarity\n",
+ " except Exception as e:\n",
+ " return np.nan\n",
+ "\n",
+ "X_test['cosine_similarity'] = X_test.apply(evaluate_single_match, axis=1)\n",
+ "X_test['percentile_cosine_similarity'] = X_test.cosine_similarity.rank(pct=True)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 2.1 Visualize cosine similarity by review score\n",
+ "\n",
+ "We group the cosine similarity scores by the review score, and plot the distribution of cosine similarity scores for each review score."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Correlation between user&vector similarity percentile metric and review number of stars (score): 22.11%\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "import statsmodels.api as sm\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "correlation = X_test[['percentile_cosine_similarity', 'Score']].corr().values[0,1]\n",
+ "print('Correlation between user&vector similarity percentile metric and review number of stars (score): %.2f%%' % (100*correlation))\n",
+ "\n",
+ "\n",
+ "# boxplot of cosine similarity for each score\n",
+ "X_test.boxplot(column='percentile_cosine_similarity', by='Score')\n",
+ "plt.title('')\n",
+ "plt.show()\n",
+ "plt.close()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can observe a weak trend, showing that the higher the similarity score between the user and the product embedding, the higher the review score. Therefore, the user and product embeddings can weakly predict the review score - even before the user receives the product!\n",
+ "\n",
+ "Because this signal works in a different way than the more commonly used collaborative filtering, it can act as an additional feature to slightly improve the performance on existing problems."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/utils.py
@@ -0,0 +1,94 @@
+import openai
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+
+from tenacity import retry, wait_random_exponential, stop_after_attempt
+from sklearn.metrics import precision_recall_curve
+from sklearn.metrics import average_precision_score
+
+
+@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+def get_embedding(text, engine="davinci-similarity"):
+
+ # replace newlines, which can negatively affect performance.
+ text = text.replace("\n", " ")
+
+ return openai.Engine(id=engine).embeddings(input = [text])['data'][0]['embedding']
+
+
+def cosine_similarity(a, b):
+ return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
+
+
+def plot_multiclass_precision_recall(
+ y_score, y_true_untransformed, class_list, classifier_name
+):
+ """
+ Precision-Recall plotting for a multiclass problem. It plots average precision-recall, per class precision recall and reference f1 contours.
+
+ Code slightly modified, but heavily based on https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
+ """
+ n_classes = len(class_list)
+ y_true = pd.concat(
+ [(y_true_untransformed == class_list[i]) for i in range(n_classes)], axis=1
+ ).values
+
+ # For each class
+ precision = dict()
+ recall = dict()
+ average_precision = dict()
+ for i in range(n_classes):
+ precision[i], recall[i], _ = precision_recall_curve(y_true[:, i], y_score[:, i])
+ average_precision[i] = average_precision_score(y_true[:, i], y_score[:, i])
+
+ # A "micro-average": quantifying score on all classes jointly
+ precision["micro"], recall["micro"], _ = precision_recall_curve(
+ y_true.ravel(), y_score.ravel()
+ )
+ average_precision["micro"] = average_precision_score(
+ y_true, y_score, average="micro"
+ )
+ print(
+ str(classifier_name)
+ + " - Average precision score over all classes: {0:0.2f}".format(
+ average_precision["micro"]
+ )
+ )
+
+ # setup plot details
+ plt.figure(figsize=(9, 10))
+ f_scores = np.linspace(0.2, 0.8, num=4)
+ lines = []
+ labels = []
+ for f_score in f_scores:
+ x = np.linspace(0.01, 1)
+ y = f_score * x / (2 * x - f_score)
+ (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2)
+ plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))
+
+ lines.append(l)
+ labels.append("iso-f1 curves")
+ (l,) = plt.plot(recall["micro"], precision["micro"], color="gold", lw=2)
+ lines.append(l)
+ labels.append(
+ "average Precision-recall (auprc = {0:0.2f})"
+ "".format(average_precision["micro"])
+ )
+
+ for i in range(n_classes):
+ (l,) = plt.plot(recall[i], precision[i], lw=2)
+ lines.append(l)
+ labels.append(
+ "Precision-recall for class `{0}` (auprc = {1:0.2f})"
+ "".format(class_list[i], average_precision[i])
+ )
+
+ fig = plt.gcf()
+ fig.subplots_adjust(bottom=0.25)
+ plt.xlim([0.0, 1.0])
+ plt.ylim([0.0, 1.05])
+ plt.xlabel("Recall")
+ plt.ylabel("Precision")
+ plt.title(f"{classifier_name}: Precision-Recall curve for each class")
+ plt.legend(lines, labels)
\ No newline at end of file
examples/embeddings/Visualize_in_2d.ipynb
@@ -0,0 +1,142 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Visualizing the embeddings in 2D\n",
+ "\n",
+ "We will use t-SNE to reduce the dimensionality of the embeddings from 2048 to 2. Once the embeddings are reduced to two dimensions, we can plot them in a 2D scatter plot. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Reduce dimensionality\n",
+ "\n",
+ "We reduce the dimensionality to 2 dimensions using t-SNE decomposition."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(1000, 2)"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.manifold import TSNE\n",
+ "\n",
+ "# Load the embeddings\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "\n",
+ "# Convert to a list of lists of floats\n",
+ "matrix = df.babbage_similarity.apply(eval).to_list()\n",
+ "\n",
+ "# Create a t-SNE model and transform the data\n",
+ "tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)\n",
+ "vis_dims = tsne.fit_transform(matrix)\n",
+ "vis_dims.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Plotting the embeddings\n",
+ "\n",
+ "We colour each review by its star rating, ranging from red to green."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can observe a decent data separation even in the reduced 2 dimensions. There seems to be a cluster of mostly negative reviews."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'Amazon ratings visualized in language using t-SNE')"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "import matplotlib\n",
+ "import numpy as np\n",
+ "\n",
+ "colors = [\"red\", \"darkorange\", \"gold\", \"turquoise\", \"darkgreen\"]\n",
+ "x = [x for x,y in vis_dims]\n",
+ "y = [y for x,y in vis_dims]\n",
+ "color_indices = df.Score.values - 1\n",
+ "\n",
+ "colormap = matplotlib.colors.ListedColormap(colors)\n",
+ "plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)\n",
+ "for score in [0,1,2,3,4]:\n",
+ " avg_x = np.array(x)[df.Score-1==score].mean()\n",
+ " avg_y = np.array(y)[df.Score-1==score].mean()\n",
+ " color = colors[score]\n",
+ " plt.scatter(avg_x, avg_y, marker='x', color=color, s=100)\n",
+ "plt.title(\"Amazon ratings visualized in language using t-SNE\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/embeddings/Zero-shot_classification.ipynb
@@ -0,0 +1,226 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Zero-shot classification using the embeddings\n",
+ "\n",
+ "In this notebook we will classify the sentiment of reviews using embeddings and zero labeled data! The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
+ "\n",
+ "We'll define positive sentiment to be 4 and 5-star reviews, and negative sentiment to be 1 and 2-star reviews. 3-star reviews are considered neutral and we won't use them for this example.\n",
+ "\n",
+ "We will perform zero-shot classification by embedding descriptions of each class and then comparing new samples to those class embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import classification_report, accuracy_score\n",
+ "\n",
+ "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
+ "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
+ "df['babbage_search'] = df.babbage_search.apply(eval).apply(np.array)\n",
+ "\n",
+ "df= df[df.Score!=3]\n",
+ "df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Zero-Shot Classification\n",
+ "To perform zero shot classification, we want to predict labels for our samples without any training. To do this, we can simply embed short descriptions of each label, such as positive and negative, and then compare the cosine distance between embeddings of samples and label descriptions. \n",
+ "\n",
+ "The highest similarity label to the sample input is the predicted label. We can also define a prediction score to be the difference between the cosine distance to the positive and to the negative label. This score can be used for plotting a precision-recall curve, which can be used to select a different tradeoff between precision and recall, by selecting a different threshold."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " negative 0.67 0.88 0.76 136\n",
+ " positive 0.98 0.93 0.95 789\n",
+ "\n",
+ " accuracy 0.92 925\n",
+ " macro avg 0.82 0.90 0.86 925\n",
+ "weighted avg 0.93 0.92 0.92 925\n",
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from utils import cosine_similarity, get_embedding\n",
+ "from sklearn.metrics import PrecisionRecallDisplay\n",
+ "\n",
+ "def evaluate_emeddings_approach(\n",
+ " labels = ['negative', 'positive'], \n",
+ " engine = 'babbage-similarity',\n",
+ "):\n",
+ " label_embeddings = [get_embedding(label, engine=engine) for label in labels]\n",
+ "\n",
+ " def label_score(review_embedding, label_embeddings):\n",
+ " return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])\n",
+ "\n",
+ " engine_col_name = engine.replace('-','_').replace('_query','')\n",
+ " probas = df[engine_col_name].apply(lambda x: label_score(x, label_embeddings))\n",
+ " preds = probas.apply(lambda x: 'positive' if x>0 else 'negative')\n",
+ "\n",
+ " report = classification_report(df.sentiment, preds)\n",
+ " print(report)\n",
+ "\n",
+ " display = PrecisionRecallDisplay.from_predictions(df.sentiment, probas, pos_label='positive')\n",
+ " _ = display.ax_.set_title(\"2-class Precision-Recall curve\")\n",
+ "\n",
+ "evaluate_emeddings_approach(labels=['negative', 'positive'], engine='babbage-similarity')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that this classifier already performs extremely well. We used similarity embeddings, and the simplest possible label name. Let's try to improve on this by using more descriptive label names, and search embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " negative 0.65 0.93 0.76 136\n",
+ " positive 0.99 0.91 0.95 789\n",
+ "\n",
+ " accuracy 0.92 925\n",
+ " macro avg 0.82 0.92 0.86 925\n",
+ "weighted avg 0.94 0.92 0.92 925\n",
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "evaluate_emeddings_approach(labels=['An Amazon review with a negative sentiment.', 'An Amazon review with a positive sentiment.'], engine='babbage-similarity')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using the search embeddings and descriptive names leads to an additional improvement in performance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " negative 0.77 0.79 0.78 136\n",
+ " positive 0.96 0.96 0.96 789\n",
+ "\n",
+ " accuracy 0.94 925\n",
+ " macro avg 0.87 0.88 0.87 925\n",
+ "weighted avg 0.94 0.94 0.94 925\n",
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "evaluate_emeddings_approach(labels=['An Amazon review with a negative sentiment.', 'An Amazon review with a positive sentiment.'], engine='babbage-search-query')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As shown above, zero-shot classification with embeddings can lead to great results, especially when the labels are more descriptive than just simple words."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/finetuning/answers-with-ft.py → examples/finetuning/answers_with_ft.py
@@ -67,8 +67,14 @@ def answer_question(
print("Context:\n" + context)
print("\n\n")
try:
+ # fine-tuned models requires model parameter, whereas other models require engine parameter
+ model_param = (
+ {"model": fine_tuned_qa_model}
+ if ":" in fine_tuned_qa_model
+ and fine_tuned_qa_model.split(":")[1].startswith("ft")
+ else {"engine": fine_tuned_qa_model}
+ )
response = openai.Completion.create(
- model=fine_tuned_qa_model,
prompt=f"Answer the question based on the context below\n\nText: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
temperature=0,
max_tokens=max_tokens,
@@ -76,6 +82,7 @@ def answer_question(
frequency_penalty=0,
presence_penalty=0,
stop=stop_sequence,
+ **model_param,
)
return response["choices"][0]["text"]
except Exception as e:
examples/finetuning/olympics-1-collect-data.ipynb
@@ -0,0 +1,513 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 1. Collect Wikipedia data about Olympic Games 2020\n",
+ "\n",
+ "The idea of this project is to create a question answering model, based on a few paragraphs of provided text. Base GPT-3 models do a good job at answering questions when the answer is contained within the paragraph, however if the answer isn't contained, the base models tend to try their best to answer anyway, often leading to confabulated answers. \n",
+ "\n",
+ "To create a model which answers questions only if there is sufficient context for doing so, we first create a dataset of questions and answers based on paragraphs of text. In order to train the model to answer only when the answer is present, we also add adversarial examples, where the question doesn't match the context. In those cases, we ask the model to output \"No sufficient context for answering the question\". \n",
+ "\n",
+ "We will perform this task in three notebooks:\n",
+ "1. The first (this) notebook focuses on collecting recent data, which GPT-3 didn't see during it's pre-training. We picked the topic of Olympic Games 2020 (which actually took place in the summer of 2021), and downloaded 713 unique pages. We organized the dataset by individual sections, which will serve as context for asking and answering the questions.\n",
+ "2. The [second notebook](olympics-2-create-qa.ipynb) will utilize Davinci-instruct to ask a few questions based on a Wikipedia section, as well as answer those questions, based on that section.\n",
+ "3. The [third notebook](olympics-3-train-qa.ipynb) will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer \"No sufficient context for answering the question\". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1.1 Data extraction using the wikipedia API\n",
+ "Extracting the data will take about half an hour, and processing will likely take about as much."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "909"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import wikipedia\n",
+ "\n",
+ "\n",
+ "def filter_olympic_2020_titles(titles):\n",
+ " \"\"\"\n",
+ " Get the titles which are related to Olympic games hosted in 2020, given a list of titles\n",
+ " \"\"\"\n",
+ " titles = [title for title in titles if '2020' in title and 'olympi' in title.lower()]\n",
+ " \n",
+ " return titles\n",
+ "\n",
+ "def get_wiki_page(title):\n",
+ " \"\"\"\n",
+ " Get the wikipedia page given a title\n",
+ " \"\"\"\n",
+ " try:\n",
+ " return wikipedia.page(title)\n",
+ " except wikipedia.exceptions.DisambiguationError as e:\n",
+ " return wikipedia.page(e.options[0])\n",
+ " except wikipedia.exceptions.PageError as e:\n",
+ " return None\n",
+ "\n",
+ "def recursively_find_all_pages(titles, titles_so_far=set()):\n",
+ " \"\"\"\n",
+ " Recursively find all the pages that are linked to the Wikipedia titles in the list\n",
+ " \"\"\"\n",
+ " all_pages = []\n",
+ " \n",
+ " titles = list(set(titles) - titles_so_far)\n",
+ " titles = filter_olympic_2020_titles(titles)\n",
+ " titles_so_far.update(titles)\n",
+ " for title in titles:\n",
+ " page = get_wiki_page(title)\n",
+ " if page is None:\n",
+ " continue\n",
+ " all_pages.append(page)\n",
+ "\n",
+ " new_pages = recursively_find_all_pages(page.links, titles_so_far)\n",
+ " for pg in new_pages:\n",
+ " if pg.title not in [p.title for p in all_pages]:\n",
+ " all_pages.append(pg)\n",
+ " titles_so_far.update(page.links)\n",
+ " return all_pages\n",
+ "\n",
+ "\n",
+ "pages = recursively_find_all_pages([\"2020 Summer Olympics\"])\n",
+ "len(pages)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1.2 Filtering the Wikipedia pages and splitting them into sections by headings\n",
+ "We remove sections unlikely to contain textual information, and ensure that each section is not longer than the token limit"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "('Bermuda at the 2020 Summer Olympics',\n",
+ " 'Equestrian',\n",
+ " \"Bermuda entered one dressage rider into the Olympic competition by finishing in the top four, outside the group selection, of the individual FEI Olympic Rankings for Groups D and E (North, Central, and South America), marking the country's recurrence to the sport after an eight-year absence. The quota was later withdrawn, following an injury of Annabelle Collins' main horse Joyero and a failure to obtain minimum eligibility requirements (MER) aboard a new horse Chuppy Checker.\",\n",
+ " 104)"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\n",
+ "import re\n",
+ "from typing import Set\n",
+ "from transformers import GPT2TokenizerFast\n",
+ "\n",
+ "import numpy as np\n",
+ "from nltk.tokenize import sent_tokenize\n",
+ "\n",
+ "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
+ "\n",
+ "def count_tokens(text: str) -> int:\n",
+ " \"\"\"count the number of tokens in a string\"\"\"\n",
+ " return len(tokenizer.encode(text))\n",
+ "\n",
+ "def reduce_long(\n",
+ " long_text: str, long_text_tokens: bool = False, max_len: int = 590\n",
+ ") -> str:\n",
+ " \"\"\"\n",
+ " Reduce a long text to a maximum of `max_len` tokens by potentially cutting at a sentence end\n",
+ " \"\"\"\n",
+ " if not long_text_tokens:\n",
+ " long_text_tokens = count_tokens(long_text)\n",
+ " if long_text_tokens > max_len:\n",
+ " sentences = sent_tokenize(long_text.replace(\"\\n\", \" \"))\n",
+ " ntokens = 0\n",
+ " for i, sentence in enumerate(sentences):\n",
+ " ntokens += 1 + count_tokens(sentence)\n",
+ " if ntokens > max_len:\n",
+ " return \". \".join(sentences[:i][:-1]) + \".\"\n",
+ "\n",
+ " return long_text\n",
+ "\n",
+ "discard_categories = ['See also', 'References', 'External links', 'Further reading', \"Footnotes\",\n",
+ " \"Bibliography\", \"Sources\", \"Citations\", \"Literature\", \"Footnotes\", \"Notes and references\",\n",
+ " \"Photo gallery\", \"Works cited\", \"Photos\", \"Gallery\", \"Notes\", \"References and sources\",\n",
+ " \"References and notes\",]\n",
+ "\n",
+ "\n",
+ "def extract_sections(\n",
+ " wiki_text: str,\n",
+ " title: str,\n",
+ " max_len: int = 1500,\n",
+ " discard_categories: Set[str] = discard_categories,\n",
+ ") -> str:\n",
+ " \"\"\"\n",
+ " Extract the sections of a Wikipedia page, discarding the the references and other low information sections\n",
+ " \"\"\"\n",
+ " if len(wiki_text) == 0:\n",
+ " return []\n",
+ "\n",
+ " # find all headings and the coresponding contents\n",
+ " headings = re.findall(\"==+ .* ==+\", wiki_text)\n",
+ " for heading in headings:\n",
+ " wiki_text = wiki_text.replace(heading, \"==+ !! ==+\")\n",
+ " contents = wiki_text.split(\"==+ !! ==+\")\n",
+ " contents = [c.strip() for c in contents]\n",
+ " assert len(headings) == len(contents) - 1\n",
+ "\n",
+ " cont = contents.pop(0).strip()\n",
+ " outputs = [(title, \"Summary\", cont, count_tokens(cont)+4)]\n",
+ " \n",
+ " # discard the discard categories, accounting for a tree structure\n",
+ " max_level = 100\n",
+ " keep_group_level = max_level\n",
+ " remove_group_level = max_level\n",
+ " nheadings, ncontents = [], []\n",
+ " for heading, content in zip(headings, contents):\n",
+ " plain_heading = \" \".join(heading.split(\" \")[1:-1])\n",
+ " num_equals = len(heading.split(\" \")[0])\n",
+ " if num_equals <= keep_group_level:\n",
+ " keep_group_level = max_level\n",
+ "\n",
+ " if num_equals > remove_group_level:\n",
+ " if (\n",
+ " num_equals <= keep_group_level\n",
+ " ):\n",
+ " continue\n",
+ " keep_group_level = max_level\n",
+ " if plain_heading in discard_categories:\n",
+ " remove_group_level = num_equals\n",
+ " keep_group_level = max_level\n",
+ " continue\n",
+ " nheadings.append(heading.replace(\"=\", \"\").strip())\n",
+ " ncontents.append(content)\n",
+ " remove_group_level = max_level\n",
+ "\n",
+ " # count the tokens of each section\n",
+ " ncontent_ntokens = [\n",
+ " count_tokens(c)\n",
+ " + 3\n",
+ " + count_tokens(\" \".join(h.split(\" \")[1:-1]))\n",
+ " - (1 if len(c) == 0 else 0)\n",
+ " for h, c in zip(nheadings, ncontents)\n",
+ " ]\n",
+ "\n",
+ " # Create a tuple of (title, section_name, content, number of tokens)\n",
+ " outputs += [(title, h, c, t) if t<max_len \n",
+ " else (title, h, reduce_long(c, max_len), count_tokens(reduce_long(c,max_len))) \n",
+ " for h, c, t in zip(nheadings, ncontents, ncontent_ntokens)]\n",
+ " \n",
+ " return outputs\n",
+ "\n",
+ "# Example page being processed into sections\n",
+ "bermuda_page = get_wiki_page('Bermuda at the 2020 Summer Olympics')\n",
+ "ber = extract_sections(bermuda_page.content, bermuda_page.title)\n",
+ "\n",
+ "# Example section\n",
+ "ber[-1]\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.2.1 We create a dataset and filter out any sections with fewer than 40 tokens, as those are unlikely to contain enough context to ask a good question."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Token indices sequence length is longer than the specified maximum sequence length for this model (1060 > 1024). Running this sequence through the model will result in indexing errors\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>title</th>\n",
+ " <th>heading</th>\n",
+ " <th>content</th>\n",
+ " <th>tokens</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Summary</td>\n",
+ " <td>The 2020 Summer Olympics (Japanese: 2020年夏季オリン...</td>\n",
+ " <td>713</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Host city selection</td>\n",
+ " <td>The International Olympic Committee (IOC) vote...</td>\n",
+ " <td>126</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Impact of the COVID-19 pandemic</td>\n",
+ " <td>In January 2020, concerns were raised about th...</td>\n",
+ " <td>369</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Qualifying event cancellation and postponement</td>\n",
+ " <td>Concerns about the pandemic began to affect qu...</td>\n",
+ " <td>298</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Effect on doping tests</td>\n",
+ " <td>Mandatory doping tests were being severely res...</td>\n",
+ " <td>163</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " title heading \\\n",
+ "0 2020 Summer Olympics Summary \n",
+ "1 2020 Summer Olympics Host city selection \n",
+ "2 2020 Summer Olympics Impact of the COVID-19 pandemic \n",
+ "3 2020 Summer Olympics Qualifying event cancellation and postponement \n",
+ "4 2020 Summer Olympics Effect on doping tests \n",
+ "\n",
+ " content tokens \n",
+ "0 The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 713 \n",
+ "1 The International Olympic Committee (IOC) vote... 126 \n",
+ "2 In January 2020, concerns were raised about th... 369 \n",
+ "3 Concerns about the pandemic began to affect qu... 298 \n",
+ "4 Mandatory doping tests were being severely res... 163 "
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "res = []\n",
+ "for page in pages:\n",
+ " res += extract_sections(page.content, page.title)\n",
+ "df = pd.DataFrame(res, columns=[\"title\", \"heading\", \"content\", \"tokens\"])\n",
+ "df = df[df.tokens>40]\n",
+ "df = df.drop_duplicates(['title','heading'])\n",
+ "df = df.reset_index().drop('index',axis=1) # reset index\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Save the section dataset\n",
+ "We will save the section dataset, for the [next notebook](olympics-2-create-qa.ipynb)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df.to_csv('olympics-data/olympics_sections.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1.3 (Optional) Exploring the data "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Concerns and controversies at the 2020 Summer Olympics 51\n",
+ "United States at the 2020 Summer Olympics 46\n",
+ "Great Britain at the 2020 Summer Olympics 42\n",
+ "Canada at the 2020 Summer Olympics 39\n",
+ "Olympic Games 39\n",
+ "Name: title, dtype: int64"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.title.value_counts().head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There appear to be winter and summer Olympics 2020. We chose to leave a little ambiguity and noise in the dataset, even though we were interested in only Summer Olympics 2020."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True 3567\n",
+ "False 305\n",
+ "Name: title, dtype: int64"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.title.str.contains('Summer').value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "False 3774\n",
+ "True 98\n",
+ "Name: title, dtype: int64"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.title.str.contains('Winter').value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "from matplotlib import pyplot as plt\n",
+ "\n",
+ "df = pd.read_csv('olympics-data/olympics_sections.csv')\n",
+ "df[['tokens']].hist()\n",
+ "# add axis descriptions and title\n",
+ "plt.xlabel('Number of tokens')\n",
+ "plt.ylabel('Number of Wikipedia sections')\n",
+ "plt.title('Distribution of number of tokens in Wikipedia sections')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the majority of section are fairly short (less than 500 tokens)."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/finetuning/olympics-2-create-qa.ipynb
@@ -0,0 +1,751 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 2. Creating a synthetic Q&A dataset\n",
+ "We use [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta), a model specialized in following instructions, to create questions based on the given context. Then we also use [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta) to answer those questions, given the same context. \n",
+ "\n",
+ "This is expensive, and will also take a long time, as we call the davinci engine for each section. You can simply download the final dataset instead.\n",
+ "\n",
+ "We're using the dataset created using the [previous notebook](olympics-1-collect-data.ipynb)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.1 Read in the data, and create a context\n",
+ "Create a context by concatenating the title, the heading and the content of that section"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>title</th>\n",
+ " <th>heading</th>\n",
+ " <th>content</th>\n",
+ " <th>tokens</th>\n",
+ " <th>context</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Summary</td>\n",
+ " <td>The 2020 Summer Olympics (Japanese: 2020年夏季オリン...</td>\n",
+ " <td>713</td>\n",
+ " <td>2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Host city selection</td>\n",
+ " <td>The International Olympic Committee (IOC) vote...</td>\n",
+ " <td>126</td>\n",
+ " <td>2020 Summer Olympics\\nHost city selection\\n\\nT...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Impact of the COVID-19 pandemic</td>\n",
+ " <td>In January 2020, concerns were raised about th...</td>\n",
+ " <td>369</td>\n",
+ " <td>2020 Summer Olympics\\nImpact of the COVID-19 p...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Qualifying event cancellation and postponement</td>\n",
+ " <td>Concerns about the pandemic began to affect qu...</td>\n",
+ " <td>298</td>\n",
+ " <td>2020 Summer Olympics\\nQualifying event cancell...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Effect on doping tests</td>\n",
+ " <td>Mandatory doping tests were being severely res...</td>\n",
+ " <td>163</td>\n",
+ " <td>2020 Summer Olympics\\nEffect on doping tests\\n...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " title heading \\\n",
+ "0 2020 Summer Olympics Summary \n",
+ "1 2020 Summer Olympics Host city selection \n",
+ "2 2020 Summer Olympics Impact of the COVID-19 pandemic \n",
+ "3 2020 Summer Olympics Qualifying event cancellation and postponement \n",
+ "4 2020 Summer Olympics Effect on doping tests \n",
+ "\n",
+ " content tokens \\\n",
+ "0 The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 713 \n",
+ "1 The International Olympic Committee (IOC) vote... 126 \n",
+ "2 In January 2020, concerns were raised about th... 369 \n",
+ "3 Concerns about the pandemic began to affect qu... 298 \n",
+ "4 Mandatory doping tests were being severely res... 163 \n",
+ "\n",
+ " context \n",
+ "0 2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ... \n",
+ "1 2020 Summer Olympics\\nHost city selection\\n\\nT... \n",
+ "2 2020 Summer Olympics\\nImpact of the COVID-19 p... \n",
+ "3 2020 Summer Olympics\\nQualifying event cancell... \n",
+ "4 2020 Summer Olympics\\nEffect on doping tests\\n... "
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "df = pd.read_csv('olympics-data/olympics_sections.csv')\n",
+ "df['context'] = df.title + \"\\n\" + df.heading + \"\\n\\n\" + df.content\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.2 Create questions based on the context\n",
+ "Use davinci-instruct to generate a number of plausible questions relating to the Wikipedia section contents.\n",
+ "\n",
+ "Note: We have used temperature=0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.\n",
+ "\n",
+ "<span style=\"color:orange\">**WARNING: This step will last a long time, and consume a lot of tokens, as it calls davinci-instruct for every section to generate a number of questions.**</span>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1. What is the 2020 Summer Olympics?\n",
+ "2. When did the 2020 Summer Olympics take place?\n",
+ "3. Who won the most medals at the 2020 Summer Olympics?\n",
+ "4. Who won the most gold medals at the 2020 Summer Olympics?\n",
+ "5. Who won the most medals at the 2020 Summer Olympics?\n"
+ ]
+ }
+ ],
+ "source": [
+ "import openai\n",
+ "\n",
+ "def get_questions(context):\n",
+ " try:\n",
+ " response = openai.Completion.create(\n",
+ " engine=\"davinci-instruct-beta-v2\",\n",
+ " prompt=f\"Write questions based on the text below\\n\\nText: {context}\\n\\nQuestions:\\n1.\",\n",
+ " temperature=0,\n",
+ " max_tokens=257,\n",
+ " top_p=1,\n",
+ " frequency_penalty=0,\n",
+ " presence_penalty=0,\n",
+ " stop=[\"\\n\\n\"]\n",
+ " )\n",
+ " return response['choices'][0]['text']\n",
+ " except:\n",
+ " return \"\"\n",
+ "\n",
+ "\n",
+ "df['questions']= df.context.apply(get_questions)\n",
+ "df['questions'] = \"1.\" + df.questions\n",
+ "print(df[['questions']].values[0][0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The prompt is designed to generate a number of questions. Example questions above were generated based on the summary section of the 2020 Summer Olympics page.\n",
+ "\n",
+ "We can observe that the questions 3 and 5 above repeat. Sometimes the generated questions could be ambiguous without the context. We will show that even despite these limitations we can create a successful model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.\n",
+ "Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 name for marketing and branding purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic. The Summer Paralympics were held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.The 2020 Games were the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter) and Nagano 1998 (Winter) games. Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games were the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea and preceding the 2022 Winter Olympics in Beijing, China.\n",
+ "New events were introduced in existing sports for 2020, including 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allowed the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee were baseball and softball, karate, sport climbing, surfing and skateboarding, the last four of which made their Olympic debuts, and the last three of which will remain on the Olympic program.The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). Host nation Japan finished third, setting a record for the most gold medals and total medals ever won by their delegation at an Olympic Games with 27 and 58. Great Britain finished fourth, with a total of 22 gold and 65 medals, becoming the first nation at the Summer Olympics to increase or equal their total medals won in the two Games subsequent to hosting them. The Russian delegation competing as the ROC (not to be confused with the Republic of China (Taiwan) which competed as Chinese Taipei, not ROC) finished fifth with 20 gold medals and third in the overall medal count, with 71 medals. Bermuda, the Philippines and Qatar won their first-ever Olympic gold medals. Burkina Faso, San Marino and Turkmenistan won their first-ever Olympic medals.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(df.content.values[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.3 Create answers based on the context\n",
+ "Use davinci-instruct to answer the questions given the relevant Wikipedia section contents\n",
+ "\n",
+ "Note: We have used temperature=0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.\n",
+ "\n",
+ "<span style=\"color:orange\">**WARNING: This step will last a long time, and consume a lot of tokens, as it calls davinci-instruct for every section to answer all the questions.**</span>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1. The 2020 Summer Olympics is an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan.\n",
+ "2. The 2020 Summer Olympics took place from 23 July to 8 August 2021.\n",
+ "3. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n",
+ "4. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n",
+ "5. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n"
+ ]
+ }
+ ],
+ "source": [
+ "def get_answers(row):\n",
+ " try:\n",
+ " response = openai.Completion.create(\n",
+ " engine=\"davinci-instruct-beta-v2\",\n",
+ " prompt=f\"Write questions based on the text below\\n\\nText: {row.context}\\n\\nQuestions:\\n{row.questions}\\n\\nAnswers:\\n1.\",\n",
+ " temperature=0,\n",
+ " max_tokens=257,\n",
+ " top_p=1,\n",
+ " frequency_penalty=0,\n",
+ " presence_penalty=0\n",
+ " )\n",
+ " return response['choices'][0]['text']\n",
+ " except Exception as e:\n",
+ " print (e)\n",
+ " return \"\"\n",
+ "\n",
+ "\n",
+ "df['answers']= df.apply(get_answers, axis=1)\n",
+ "df['answers'] = \"1.\" + df.answers\n",
+ "df = df.dropna().reset_index().drop('index',axis=1)\n",
+ "print(df[['answers']].values[0][0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "These are the answers to the questions above based on the context around the host city selection. \n",
+ "\n",
+ "We can see that answers 3-5 contain the correct answer, but instead of answering the question directly, the answer is a verbatim extraction. Despite these occasional lower quality answers, we will show that the model can learn the task reasonably well, given a high number of examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.4 Save the Olympics Q&A dataset based on Wikipedia sections\n",
+ "We save the file for use in the [next notebook](olympics-3-train-qa.ipynb)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df.to_csv('olympics-data/olympics_qa.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.5 Search file\n",
+ "We create a search file ([API reference](https://beta.openai.com/docs/api-reference/files/list)), which can be used to retrieve the relevant context when a question is asked.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = df[df.tokens<2000]\n",
+ "df[['context', 'tokens']].rename(columns={'context':'text','tokens':'metadata'}).to_json('olympics-data/olympics_search.jsonl', orient='records', lines=True)\n",
+ "\n",
+ "search_file = openai.File.create(\n",
+ " file=open(\"olympics-data/olympics_search.jsonl\"),\n",
+ " purpose='search'\n",
+ ")\n",
+ "olympics_search_fileid = search_file['id']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.6 Answer questions based on the context provided\n",
+ "\n",
+ "We will use a simple implementation of the answers endpoint. This works by simply using the [/search endpoint](https://beta.openai.com/docs/api-reference/searches), which searches over an indexed file to obtain the relevant sections which can be included in the context, following by a question and answering prompt given a specified model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay\n",
+ "Summary\n",
+ "\n",
+ "The women's 4 × 100 metres relay event at the 2020 Summer Olympics took place on 5 and 6 August 2021 at the Japan National Stadium. There were 16 competing relay teams, with each team having 5 members from which 4 were selected in each round.\n",
+ "\n",
+ "###\n",
+ "\n",
+ "Athletics at the 2020 Summer Olympics – Men's 4 × 100 metres relay\n",
+ "Qualification\n",
+ "\n",
+ "National Olympic Committees (NOCs) could qualify one relay team in one of three following ways:\n",
+ "The top 8 NOCs at the 2019 World Athletics Championships qualified a relay team.\n",
+ "The top 8 NOCs at the 2021 World Athletics Relays qualified a relay team.\n",
+ "Where an NOC placed in the top 8 at both the 2019 World Championships and the 2021 World Relays, the quota place was allocated to the world top list as of 29 June 2021. In this case, 4 teams did so, so there are 4 places available through the world rankings.A total of five athletes may be entered for a relay team. Should a NOC have also entered individual athletes in the corresponding individual event (100 m), the entered individual athletes must be included in the total of five (5) athletes entered for the relay event. In addition of five, NOCs can nominate a maximum of one alternate athlete for each team.\n",
+ "The qualifying period was originally from 1 May 2019 to 29 June 2020. Due to the COVID-19 pandemic, the period was suspended from 6 April 2020 to 30 November 2020, with the end date extended to 29 June 2021. The qualifying time standards could be obtained in various meets during the given period that have the approval of the IAAF. Both indoor and outdoor meets are eligible. The most recent Area Championships may be counted in the ranking, even if not during the qualifying period.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from answers_with_ft import create_context, answer_question\n",
+ "print(create_context(\"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?\", olympics_search_fileid, max_len=400))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' Japan National Stadium'"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "answer_question(olympics_search_fileid, \"davinci-instruct-beta-v2\", \n",
+ " \"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After we fine-tune the model for Q&A we'll be able to use it instead of [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta), to obtain better answers when the question can't be answered based on the context. We see a downside of [`davinci-instruct-beta-v2`](https://beta.openai.com/docs/engines/instruct-series-beta), which always attempts to answer the question, regardless of the relevant context being present or not. (Note the second question is asking about a future event, set in 2024.)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' Japan National Stadium'"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "answer_question(olympics_search_fileid, \"davinci-instruct-beta-v2\", \n",
+ " \"Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?\", max_len=1000)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that davinci has a tendency to answer the question, even if the question can't be answered given the context provided. Note the question asked regarding 2048 Summer Olympics, which didn't happen yet, and the retrieved content has only returned results for 2020."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2.7 (Optional) Investigation into how likely the search endpoint is to return the relevant context"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(0, 58)\n"
+ ]
+ }
+ ],
+ "source": [
+ "def check_context(title, heading, question, max_len=1800, search_model='ada', max_rerank=10):\n",
+ " \"\"\"\n",
+ " Evaluate the performance of the search model in retrieving the correct context\n",
+ "\n",
+ " Parameters\n",
+ " ----------\n",
+ " title: str\n",
+ " The title of the Wikipedia page\n",
+ " heading: str\n",
+ " The heading of the Wikipedia section\n",
+ " qusetion: str\n",
+ " The question\n",
+ " max_len: int\n",
+ " The maximum length of the context\n",
+ " search_model: str\n",
+ " The search model to use - `ada` is most cost effective\n",
+ " max_rerank: int\n",
+ " The maximum number of reranking documents to use the search model on\n",
+ "\n",
+ " Returns\n",
+ " -------\n",
+ " rank: int\n",
+ " The rank of the correct context\n",
+ " token_length: int\n",
+ " The number of tokens needed to obtain the correct context\n",
+ " \"\"\"\n",
+ " \n",
+ " try:\n",
+ " results = openai.Engine(search_model).search(\n",
+ " search_model=search_model, \n",
+ " query=question, \n",
+ " max_rerank=max_rerank,\n",
+ " file=olympics_search_fileid,\n",
+ " return_metadata=True\n",
+ " )\n",
+ " index=-1\n",
+ " returns = []\n",
+ " cur_len = 0\n",
+ " for result in results['data']:\n",
+ " cur_len += int(result['metadata']) + 4 # we add 4 tokens for the separator `\\n\\n###\\n\\n`\n",
+ " if cur_len > max_len:\n",
+ " break\n",
+ " returns.append(result['text'])\n",
+ " res = result['text'].split('\\n')\n",
+ " if res[0] == title and res[1] == heading:\n",
+ " index = len(returns) - 1\n",
+ " break\n",
+ " return index, cur_len\n",
+ " except Exception as e:\n",
+ " #print (e)\n",
+ " return []\n",
+ "print(check_context(\"Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay\", \"Summary\", \"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?\", max_len=10000))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We utilize the generated questions based on context to estimate how often we can retrieve the original context. These questions are noisy, so this is not a perfect estimate.\n",
+ "\n",
+ "Our questions and answers are prefixed with numbered bullet points, however due to the way they were generated, they are missing the first number, hence we add \"1.\" to the list of questions (and answers).\n",
+ "\n",
+ "We calculate the rank of the section retrieved using ada search, and the number of tokens in the context needed to retrieve the relevant section in full."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 [(132, 27104), (-1, 22939), (8, 2151), (2, 121...\n",
+ "1 [(4, 1737), (0, 130), (8, 744), (96, 17208), (...\n",
+ "2 [(0, 373), (0, 373), (-1, 40610), (1, 570)]\n",
+ "3 [(0, 302), (0, 302), (5, 968), (8, 1425)]\n",
+ "4 [(0, 167), (0, 167), (2, 1442)]\n",
+ "Name: ada, dtype: object"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ada_results = df.apply(lambda x: [\n",
+ " check_context( x.title, \n",
+ " x.heading, \n",
+ " q[3:], # remove the number prefix\n",
+ " max_len=1000000, # set a large number to get the full context \n",
+ " search_model='ada', \n",
+ " max_rerank=200,\n",
+ " ) \n",
+ " for q in (x.questions).split('\\n') # split the questions\n",
+ " if len(q) >10 # remove the empty questions\n",
+ " ], axis=1)\n",
+ "ada_results.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "out = pd.concat([ada_results], axis=1)\n",
+ "out.columns = ['ada']\n",
+ "out.to_csv('olympics-data/search_engine_results.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def expand_lists(out):\n",
+ " \"\"\"\n",
+ " Expand a pandas series containing lists into a series, where each list element becomes a value on its own\n",
+ "\n",
+ " Input is a row per paragraph, which has multiple questions\n",
+ " Output is a row per question\n",
+ " \"\"\"\n",
+ " cols = [pd.DataFrame(out[name].tolist()).stack().reset_index(level=1, drop=True).rename(name) for name in out.columns] \n",
+ " return pd.concat(cols, axis=1)\n",
+ "\n",
+ "out_expanded = expand_lists(out)\n",
+ "out_expanded['rank'] = out_expanded.ada.apply(lambda x: x[0] if x != [] else -2)\n",
+ "out_expanded['tokens'] = out_expanded.ada.apply(lambda x: x[1] if x != [] else -2)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "74.3% of relevant paragraphs are retrieved within the first 2k tokens\n"
+ ]
+ }
+ ],
+ "source": [
+ "within_2k = (out_expanded.tokens < 2000).mean()\n",
+ "print(f\"{within_2k*100:.1f}% of relevant paragraphs are retrieved within the first 2k tokens\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The relevant context can be obtained 74% of the time on this dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "7.4% of relevant paragraphs are not retrieved within the first 200 results\n"
+ ]
+ }
+ ],
+ "source": [
+ "outside_200 = (out_expanded['rank'] == -1).mean()\n",
+ "print(f\"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "7.4% of the time, this is due to the keyword search part of the search algorithm not retrieving the relevant context within the first 200 results.\n",
+ "18.3% of the time this is due to the semantic search not placing the relevant context within the first 2000 tokens."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# plot a histogram, and add axis descriptions and title\n",
+ "out_expanded[(out_expanded['rank'] >=0)&(out_expanded['rank'] <30)]['rank'].hist(bins=29)\n",
+ "plt.xlabel('rank')\n",
+ "plt.ylabel('count')\n",
+ "plt.title('Histogram of ranks of retrieved paragraphs')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "out_expanded[(out_expanded.tokens>=0)&(out_expanded.tokens < 2000)]['tokens'].hist(bins=29)\n",
+ "plt.xlabel('tokens')\n",
+ "plt.ylabel('count')\n",
+ "plt.title('Histogram of the number of minimum tokens needed')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can observe that the context is most likely to be returned as one of the first results, and most likely to be returned within the first 200-500 tokens."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-2 0.000063\n",
+ "-1 0.074428\n",
+ " 0 0.453420\n",
+ " 1 0.089515\n",
+ " 2 0.047146\n",
+ " 3 0.032437\n",
+ " 4 0.024139\n",
+ " 5 0.019676\n",
+ " 6 0.015967\n",
+ " 7 0.013452\n",
+ " 8 0.011189\n",
+ " 9 0.009869\n",
+ " 10 0.009178\n",
+ "Name: rank, dtype: float64"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# normalized value_counts\n",
+ "out_expanded['rank'].value_counts(normalize=True).sort_index()[:13]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "probabilities of the relevant context being returned at each rank. (-2 means a processing error, -1 means the rank is >200)"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
examples/finetuning/olympics-3-train-qa.ipynb
@@ -0,0 +1,637 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 3. Train a fine-tuning model specialized for Q&A\n",
+ "This notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer \"No sufficient context for answering the question\". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.\n",
+ "\n",
+ "We will add hard adversarial examples as well, which will be based either on semantically similar sections, or neighbouring sections, originating from the same article."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>title</th>\n",
+ " <th>heading</th>\n",
+ " <th>content</th>\n",
+ " <th>tokens</th>\n",
+ " <th>context</th>\n",
+ " <th>questions</th>\n",
+ " <th>answers</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Summary</td>\n",
+ " <td>The 2020 Summer Olympics (Japanese: 2020年夏季オリン...</td>\n",
+ " <td>713</td>\n",
+ " <td>2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ...</td>\n",
+ " <td>1. What is the 2020 Summer Olympics?\\n2. When ...</td>\n",
+ " <td>1. The 2020 Summer Olympics is an internationa...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Host city selection</td>\n",
+ " <td>The International Olympic Committee (IOC) vote...</td>\n",
+ " <td>126</td>\n",
+ " <td>2020 Summer Olympics\\nHost city selection\\n\\nT...</td>\n",
+ " <td>1. \\n2. \\n3. \\n4.</td>\n",
+ " <td>1. What is the International Olympic Committee...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Impact of the COVID-19 pandemic</td>\n",
+ " <td>In January 2020, concerns were raised about th...</td>\n",
+ " <td>369</td>\n",
+ " <td>2020 Summer Olympics\\nImpact of the COVID-19 p...</td>\n",
+ " <td>1. What was the COVID-19 pandemic?\\n2. How did...</td>\n",
+ " <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Qualifying event cancellation and postponement</td>\n",
+ " <td>Concerns about the pandemic began to affect qu...</td>\n",
+ " <td>298</td>\n",
+ " <td>2020 Summer Olympics\\nQualifying event cancell...</td>\n",
+ " <td>1. What was the original location of the Asia ...</td>\n",
+ " <td>1. The original location of the Asia & Oceania...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>2020 Summer Olympics</td>\n",
+ " <td>Effect on doping tests</td>\n",
+ " <td>Mandatory doping tests were being severely res...</td>\n",
+ " <td>163</td>\n",
+ " <td>2020 Summer Olympics\\nEffect on doping tests\\n...</td>\n",
+ " <td>1. What was the COVID-19 pandemic?\\n2. What di...</td>\n",
+ " <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " title heading \\\n",
+ "0 2020 Summer Olympics Summary \n",
+ "1 2020 Summer Olympics Host city selection \n",
+ "2 2020 Summer Olympics Impact of the COVID-19 pandemic \n",
+ "3 2020 Summer Olympics Qualifying event cancellation and postponement \n",
+ "4 2020 Summer Olympics Effect on doping tests \n",
+ "\n",
+ " content tokens \\\n",
+ "0 The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 713 \n",
+ "1 The International Olympic Committee (IOC) vote... 126 \n",
+ "2 In January 2020, concerns were raised about th... 369 \n",
+ "3 Concerns about the pandemic began to affect qu... 298 \n",
+ "4 Mandatory doping tests were being severely res... 163 \n",
+ "\n",
+ " context \\\n",
+ "0 2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ... \n",
+ "1 2020 Summer Olympics\\nHost city selection\\n\\nT... \n",
+ "2 2020 Summer Olympics\\nImpact of the COVID-19 p... \n",
+ "3 2020 Summer Olympics\\nQualifying event cancell... \n",
+ "4 2020 Summer Olympics\\nEffect on doping tests\\n... \n",
+ "\n",
+ " questions \\\n",
+ "0 1. What is the 2020 Summer Olympics?\\n2. When ... \n",
+ "1 1. \\n2. \\n3. \\n4. \n",
+ "2 1. What was the COVID-19 pandemic?\\n2. How did... \n",
+ "3 1. What was the original location of the Asia ... \n",
+ "4 1. What was the COVID-19 pandemic?\\n2. What di... \n",
+ "\n",
+ " answers \n",
+ "0 1. The 2020 Summer Olympics is an internationa... \n",
+ "1 1. What is the International Olympic Committee... \n",
+ "2 1. The COVID-19 pandemic was a pandemic that o... \n",
+ "3 1. The original location of the Asia & Oceania... \n",
+ "4 1. The COVID-19 pandemic was a pandemic that o... "
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import openai\n",
+ "import pandas as pd\n",
+ "df = pd.read_csv('olympics-data/olympics_qa.csv')\n",
+ "olympics_search_fileid = \"file-c3shd8wqF3vSCKaukW4Jr1TT\"\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Split the sections into a training and testing set"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(3014, 754)"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
+ "len(train_df), len(test_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "we check that he separator we intend to use isn't present within the contexts"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.context.str.contains('->').sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3.1 Create the fine-tuning datasets for Q&A and discriminator models\n",
+ "The fine-tuning dataset is created in the following way. For every corresponding question, answer and context pair we create:\n",
+ "- Positive example: correct question, answer, context pair\n",
+ "- Negative examples:\n",
+ " - random negative example, where the random context is paired with the question \n",
+ " - two hard negative examples\n",
+ " - one originating from the same wikipedia article\n",
+ " - another, which is most similar to the correct context\n",
+ "\n",
+ "This process is noisy, as sometimes the question might be answerable given a different context, but on average we hope this won't affect the peformance too much.\n",
+ "\n",
+ "We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import random\n",
+ "\n",
+ "def get_random_similar_contexts(question, context, file_id=olympics_search_fileid, search_model='ada', max_rerank=10):\n",
+ " \"\"\"\n",
+ " Find similar contexts to the given context using the search file\n",
+ " \"\"\"\n",
+ " try:\n",
+ " results = openai.Engine(search_model).search(\n",
+ " search_model=search_model, \n",
+ " query=question, \n",
+ " max_rerank=max_rerank,\n",
+ " file=file_id\n",
+ " )\n",
+ " candidates = []\n",
+ " for result in results['data'][:3]:\n",
+ " if result['text'] == context:\n",
+ " continue\n",
+ " candidates.append(result['text'])\n",
+ " random_candidate = random.choice(candidates)\n",
+ " return random_candidate\n",
+ " except Exception as e:\n",
+ " print(e)\n",
+ " return \"\"\n",
+ "\n",
+ "def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):\n",
+ " \"\"\"\n",
+ " Create a dataset for fine tuning the OpenAI model; either for a discriminator model, \n",
+ " or a model specializing in Q&A, where it says if no relevant context is found.\n",
+ "\n",
+ " Parameters\n",
+ " ----------\n",
+ " df: pd.DataFrame\n",
+ " The dataframe containing the question, answer and context pairs\n",
+ " discriminator: bool\n",
+ " Whether to create a dataset for the discriminator\n",
+ " n_negative: int\n",
+ " The number of random negative samples to add (using a random context)\n",
+ " add_related: bool\n",
+ " Whether to add the related contexts to the correct context. These are hard negative examples\n",
+ "\n",
+ " Returns\n",
+ " -------\n",
+ " pd.DataFrame\n",
+ " The dataframe containing the prompts and completions, ready for fine-tuning\n",
+ " \"\"\"\n",
+ " rows = []\n",
+ " for i, row in df.iterrows():\n",
+ " for q, a in zip((\"1.\" + row.questions).split('\\n'), (\"1.\" + row.answers).split('\\n')):\n",
+ " if len(q) >10 and len(a) >10:\n",
+ " if discriminator:\n",
+ " rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" yes\"})\n",
+ " else:\n",
+ " rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" {a[2:].strip()}\"})\n",
+ "\n",
+ " for i, row in df.iterrows():\n",
+ " for q in (\"1.\" + row.questions).split('\\n'):\n",
+ " if len(q) >10:\n",
+ " for j in range(n_negative + (2 if add_related else 0)):\n",
+ " random_context = \"\"\n",
+ " if j == 0 and add_related:\n",
+ " # add the related contexts based on originating from the same wikipedia page\n",
+ " subset = df[(df.title == row.title) & (df.context != row.context)]\n",
+ " \n",
+ " if len(subset) < 1:\n",
+ " continue\n",
+ " random_context = subset.sample(1).iloc[0].context\n",
+ " if j == 1 and add_related:\n",
+ " # add the related contexts based on the most similar contexts according to the search\n",
+ " random_context = get_random_similar_contexts(q[2:].strip(), row.context, search_model='ada', max_rerank=10)\n",
+ " else:\n",
+ " while True:\n",
+ " # add random context, which isn't the correct context\n",
+ " random_context = df.sample(1).iloc[0].context\n",
+ " if random_context != row.context:\n",
+ " break\n",
+ " if discriminator:\n",
+ " rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" no\"})\n",
+ " else:\n",
+ " rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" No appropriate context found to answer the question.\"})\n",
+ "\n",
+ " return pd.DataFrame(rows) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": []
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "for name, is_disc in [('discriminator', True), ('qa', False)]:\n",
+ " for train_test, dt in [('train', train_df), ('test', test_df)]:\n",
+ " ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)\n",
+ " ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We formatted the data according to the recommendations from the fine-tuning tool, which is available using\n",
+ "> openai tools fine_tunes.prepare_data -f qa_train.jsonl\n",
+ "\n",
+ "We highly recommend that you use this tool, which suggests improvements in your data formatting for fine-tuning.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3.2 Submit the datasets for fine-tuning"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": []
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "!openai api fine_tunes.create -t \"olympics-data/discriminator_train.jsonl\" -v \"olympics-data/discriminator_test.jsonl\" --no_packing --batch_size 16 --compute_classification_metrics --classification_positive_class \" yes\" --model ada"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": []
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "!openai api fine_tunes.create -t \"olympics-data/qa_train.jsonl\" -v \"olympics-data/qa_test.jsonl\" --no_packing --batch_size 16"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3.3 Using the fine-tuned models\n",
+ "\n",
+ "We will now use the fine-tuned discriminator and the fine-tuned Q&A model. By requesting logprobs, we can see how certain the discriminator is in a `yes` vs `no` answer."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[<OpenAIObject at 0x7fe812e602b0> JSON: {\n",
+ " \" no\": -10.819577,\n",
+ " \" yes\": -2.045765e-05\n",
+ " }]"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ft_discriminator = \"curie:ft-openai-internal-2021-08-23-23-58-57\"\n",
+ "ft_qa = \"curie:ft-openai-internal-2021-08-23-17-54-10\"\n",
+ "\n",
+ "def apply_ft_discriminator(context, question, discriminator_model):\n",
+ " \"\"\"\n",
+ " Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.\n",
+ " \"\"\"\n",
+ " prompt = f\"{context}\\nQuestion: {question}\\n Related:\"\n",
+ " result = openai.Completion.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)\n",
+ " return result['choices'][0]['logprobs']['top_logprobs']\n",
+ "\n",
+ "apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
+ " 'What was the first human-made object in space?', ft_discriminator)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the model can generalize well to different contexts and questions. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "def apply_ft_qa_answer(context, question, answering_model):\n",
+ " \"\"\"\n",
+ " Apply the fine tuned discriminator to a question\n",
+ " \"\"\"\n",
+ " prompt = f\"{context}\\nQuestion: {question}\\nAnswer:\"\n",
+ " result = openai.Completion.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\\n'])\n",
+ " return result['choices'][0]['text']\n",
+ "\n",
+ "apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
+ " 'What was the first human-made object in space?', ft_qa)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the model can answer the question, when the context is appropriate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' The Soviet Union was the first country to successfully launch a satellite into space'"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
+ " 'What is impressive about the Soviet Union?', ft_qa)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' No appropriate context found to answer the question'"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
+ " 'How many cars were produced in the Soviet Union in 1970?', ft_qa)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the model knows when to answer the question, and when to say that insufficient context is present to answer the question."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also combine a discriminator and a base model, or a fine-tuned Q&A model. Discriminator can essentially serve as a decision whether the question can be answered given the context or not."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "' Weather could cause a sport event to have no crowd'"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):\n",
+ " logprobs = apply_ft_discriminator(context, question, discriminator_model)\n",
+ " yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100\n",
+ " no_logprob = logprobs[' no'] if ' no' in logprobs else -100\n",
+ " if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:\n",
+ " return \" No appropriate context found to answer the question based on the discriminator.\"\n",
+ " return apply_ft_qa_answer(context, question, answering_model)\n",
+ "answer_question_conditionally(ft_qa, ft_discriminator, \n",
+ " \"Crowdless games are a rare although not unheard-of occurrence in sports. \\\n",
+ " When they do occur, it is usually the result of events beyond the control \\\n",
+ " of the teams or fans, such as weather-related concerns, public health concerns, \\\n",
+ " or wider civil disturbances unrelated to the game. For instance, \\\n",
+ " the COVID-19 pandemic caused many sports leagues around the world \\\n",
+ " to be played behind closed doors.\",\n",
+ " \"Could weather cause a sport event to have no crowd?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The above function illustrates how to potentially combine a discriminator and a fine-tuned Q&A model. This gives a more fine-grained control over how certain we want the model to be before it answers the question.\n",
+ "\n",
+ "We'll now take a look on how answers endpoint works - combining search to retrieve the relevant context from a knowledge base, and then using the fine-tuned Q&A model to answer the question."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3.4 Answering the question based on a knowledge base\n",
+ "Finally we can use a logic similar to the [/answers](https://beta.openai.com/docs/api-reference/answers) endpoint, where we first search for the relevant context, and then ask a Q&A model to answer the question given that context. If you'd like to see the implementation details, check out the [`answers_with_ft.py`](answers_with_ft.py) file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\" Canada won the Women's football tournament at the 2020 Olympic games\""
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from answers_with_ft import answer_question\n",
+ "answer_question(olympics_search_fileid, ft_qa, \"Which country won the Women's football tournament at the 2020 Olympic games?\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
+ },
+ "kernelspec": {
+ "display_name": "Python 3.7.3 64-bit ('base': conda)",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
openai/validators.py
@@ -654,7 +654,8 @@ def get_batch_size_suggestion(df, no_packing):
batch_size = BATCH_SIZE_TO_N_EXAMPLES_RATIO * n_examples
else:
batch_size = BATCH_SIZE_TO_N_CHARACTERS_RATIO * n_characters
- batch_size = 2 ** int(np.log2(batch_size))
+
+ batch_size = max(1, int(2 ** np.ceil(np.log2(batch_size))))
batch_size_suggestion = f" --batch_size {batch_size}"
return batch_size_suggestion
@@ -694,7 +695,7 @@ def write_out_file(df, fname, any_remediations, auto_accept):
input_text = "\n\nYour data will be written to a new JSONL file. Proceed [Y/n]: "
- if not any_remediations:
+ if not any_remediations and not split:
sys.stdout.write(
f'\nYou can use your file for fine-tuning:\n> openai api fine_tunes.create -t "{fname}"{additional_params}\n\nAfter you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `{common_prompt_suffix_new_line_handled}` for the model to start generating completions, rather than continuing with the prompt.{optional_ending_string}\n'
)
openai/version.py
@@ -1,1 +1,1 @@
-VERSION = "0.11.0"
+VERSION = "0.11.1"