The Chunk Splitting Problem

April 8, 2024 | STATUS: CLOSED

For this experiment I compared three existing indexing methods: Rule-Based Chunking, Semantic Chunking, and Slide Retrieval in terms of time, embedding cost, and retrieved response quality. Additionally, I designed a fourth experimental method, Neural Network Estimation Retrieval (NNER), for indexing a text corpus. NNER uses a neural network to create a linear representation of a corpus, and learns to extract information from a specific location of the corpus regardless of how it is broken into chunks, which overcomes the "chunk splitting problem" and saves on embedding costs.

Unfortunately, in my testing the NNER method severely underperformed, which could be due to a wide range of factors (listed in the attached report). I am currently not interested in continuing this project, although I still believe that with the proper adjustments NNER could be a viable alternative.


Motivations:

I had become aware of the "chunk splitting problem" while working on an RAG system for my co-op, and I wanted to experiment with an idea I had for a new indexing method which I believed had potential to circumvent this problem. This project was a way for me to apply what I learned about RAG on my co-op, while also exercising my creative spirit by trying to innovate in the RAG space.

Implementation:

All four methods were implemented in Python using LlamaIndex, NLTK, and PyTorch. The dataset was created using OpenAI GPT4 and Wikipedia.

Future Work:

  • Organize and document the repo
  • Test out various potential improvements to NNER
  • Optimize all models in terms of speed (using existing indexing libraries)
  • Use human validation for checking response quality
  • Try more varied datasets