Optimized version of sliding window for semantic chunking #5

labdmitriy · 2024-04-23T05:59:18Z

Hi Greg,

Thanks a lot for you work!

I want to share with more optimized version of your function combine_sentences from the tutorial about text splitting.
Instead of this function:

def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences

We can use generators and Python standard libraries to generate windows more efficiently:

from collections import deque
from itertools import islice

def sliding_window(sentences, buffer_size=1):
    window = deque(islice(sentences, buffer_size), maxlen=2*buffer_size+1)
    
    for sentence in sentences[buffer_size:]:
        window.append(sentence)
        yield tuple(window)
        
    while len(window) > buffer_size + 1:
        window.popleft()
        yield tuple(window)

for i, window_sentences in enumerate(sliding_window(single_sentences_list, buffer_size=2)):
    sentence_dicts[i]['combined_sentence'] = ' '.join(window_sentences)

By the way, I found that splitting by punctuation symbols is not working in the tutorial if there are no spaces after punctuation symbol before the next sentence (because of regex (?<=[.?!])\s+), could do you please tell did you explore more sophisticated methods of text splitting by sentences, and the influence of these methods on overall quality of semantic chunking?

Thank you.

The text was updated successfully, but these errors were encountered:

gkamradt · 2024-04-23T15:07:05Z

Hey nice! That looks slick. Thank you for sharing.

I didn't explore a more sophisticated method (there are definitely other ways) because I was moving quick for the tutorial MVP.

I'll keep this optimized method in mind for when I update the tutorial code

Pythonaire · 2024-07-02T21:06:01Z

one word ... amazing
made some quality checks for the input text, before starting the chunking process in my RAG. Tested with PDF documents.

def apply_substitutions(self, text, substitutions):
        for pattern, replacement, *flags in substitutions:
            if flags:
                text = re.sub(pattern, replacement, text, flags=flags[0])
            else:
                text = re.sub(pattern, replacement, text)
        return text

# List of (pattern, replacement, flags) tuples
substitutions = [
            (r'\f|\n{2,}', '\n' ),                  # remove page breaks ('\u000C') or multiple new lines (maybe "new page"?) 
            (r'^[•◦]', '', re.MULTILINE),          # remove bullets
            (r'\. (?=[A-Z])|\.(?=[A-Z])', r'. '),   # normalize periods followed by spaces and uppercase letters
            (r'\n\s*\n', '\n'),                     # remove blank lines
            (r'-[\f\n\s]*', ''),                    # german language, hyphen at the end of a line followed by a newline
            (r'^\s+|\s+$', ''),                     # remove leading and trailing white spaces
            (r'\s+', ' ')                           # replace multiple whitespaces with single whitesapce
            ]

Using a plain TXT file got:

breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)

....
ollamapy/.venv/lib/python3.12/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
                                  ~~~^^^^^^^^^
IndexError: index -1 is out of bounds for axis 0 with size 0

change the script:

if len(distances):
            breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold) 
´´´

gkamradt · 2024-07-05T15:12:44Z

Thanks for bringing both of these up.

This repo isn't actively maintained and won't be updated for a bit. Apologies but too many projects going on!

If anyone really wants to help develop on it please contact me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized version of sliding window for semantic chunking #5

Optimized version of sliding window for semantic chunking #5

labdmitriy commented Apr 23, 2024 •

edited

Loading

gkamradt commented Apr 23, 2024

Pythonaire commented Jul 2, 2024 •

edited

Loading

gkamradt commented Jul 5, 2024

Optimized version of sliding window for semantic chunking #5

Optimized version of sliding window for semantic chunking #5

Comments

labdmitriy commented Apr 23, 2024 • edited Loading

gkamradt commented Apr 23, 2024

Pythonaire commented Jul 2, 2024 • edited Loading

gkamradt commented Jul 5, 2024

labdmitriy commented Apr 23, 2024 •

edited

Loading

Pythonaire commented Jul 2, 2024 •

edited

Loading