A new minor version of the openai-python
package was released late on Friday 7 June 2024, only a couple of days after the last minor release. This release adds a chunking_strategy
argument to the methods for adding files to vector stores.
What is Chunking?
Chunking is the process of breaking a large piece of text down into smaller segments (or “chunks”). Selecting an appropriate chunk size should ensure that LLM results are accurate and relevant. Ideally you want the size of the chunks to be neither too small nor too large. If the chunks are too small then the LLM might fail to understand the necessary context surrounding the chunk (although chunk overlap can help with this!). If the chunks are too large then the LLM might find it difficult to identify the relevant content within the chunk. As a general principle, if a chunk makes sense to a human without the surrounding context, then it should also make sense to the LLM. A good chunk size can be determined empirically, starting from large chunks and gradually decreasing size until there’s a marked deterioration in results.
Let’s take a quick look at how this works.
Create an OpenAI Client
import os
from openai import OpenAI
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)
Create a Vector Store
If you haven’t already created a vector store then do so now.
store = client.beta.vector_stores.create(name="Test")
This is what the resulting vector store object looks like:
VectorStore(
id='vs_T7xl1a13glOcHLK7Xzjn08DT',
created_at=1717821463,
file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0),
last_active_at=1717821463,
metadata={},
name='Test',
object='vector_store',
status='completed',
usage_bytes=0,
expires_after=None,
expires_at=None
)
Assign the vector store ID to a variable.
VECTOR_STORE_ID = "vs_T7xl1a13glOcHLK7Xzjn08DT"
Upload File (Default Chunking)
First let’s upload a file using the default chunking strategy.
# Path of file to upload.
FILE_PATH = "pg844.txt"
with open(FILE_PATH, "rb") as f:
file = client.beta.vector_stores.files.upload_and_poll(
vector_store_id=VECTOR_STORE_ID,
file=f
)
The resulting file
object looks like this:
VectorStoreFile(
id='file-dxc8vvwhT5j3obhaaZw0WVdF',
created_at=1717825354,
last_error=None,
object='vector_store.file',
status='completed',
usage_bytes=236235,
vector_store_id='vs_T7xl1a13glOcHLK7Xzjn08DT',
chunking_strategy=ChunkingStrategyStatic(
static=ChunkingStrategyStaticStatic(
chunk_overlap_tokens=400,
max_chunk_size_tokens=800
),
type='static'
)
)
The default values for max_chunk_size_tokens
and chunk_overlap_tokens
mean that files are indexed by being split into 800-token chunks with 400-token overlap between consecutive chunks.
You’d get the same result if you set chunking_strategy
to the default auto
strategy.
with open(FILE_PATH, "rb") as f:
file = client.beta.vector_stores.files.upload_and_poll(
vector_store_id=VECTOR_STORE_ID,
file=f,
chunking_strategy={"type": "auto"}
)
Upload File (Static Chunking)
Alternatively, you can specify static
chunking and provide particular values for chunk_overlap_tokens
and max_chunk_size_tokens
.
with open(FILE_PATH, "rb") as f:
file = client.beta.vector_stores.files.upload_and_poll(
vector_store_id=VECTOR_STORE_ID,
file=f,
chunking_strategy={
"type": "static",
"static": {
"chunk_overlap_tokens": 200,
"max_chunk_size_tokens": 400
}
}
)
And you can see the changes in the returned object.
VectorStoreFile(
id='file-rnEXDvko5UEfW1DSKwelWawQ',
created_at=1717825361,
last_error=None,
object='vector_store.file',
status='completed',
usage_bytes=332491,
vector_store_id='vs_T7xl1a13glOcHLK7Xzjn08DT',
chunking_strategy=ChunkingStrategyStatic(
static=ChunkingStrategyStaticStatic(
chunk_overlap_tokens=200,
max_chunk_size_tokens=400
),
type='static'
)
)
Note that chunk_overlap_tokens
and max_chunk_size_tokens
reflect the specified values and usage_bytes
has changed due to the different way that the file content has been chunked.
There are a couple of constraints on these parameters:
max_chunk_size_tokens
must be between 100 and 4096, whilechunk_overlap_tokens
must be non-negative (zero overlap is allowed) and not more than half ofmax_chunk_size_tokens
.
💡 You can use client.beta.vector_stores.file_batches.upload_and_poll
to upload multiple files in a batch.
List Files in Vector Store
List the files in the vector store.
client.beta.vector_stores.files.list(vector_store_id=VECTOR_STORE_ID)
Delete Vector Store
Finally, delete the vector store.
client.beta.vector_stores.delete(vector_store_id=VECTOR_STORE_ID)