Skip to content

Optimized version of sliding window for semantic chunking #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
labdmitriy opened this issue Apr 23, 2024 · 3 comments
Open

Optimized version of sliding window for semantic chunking #5

labdmitriy opened this issue Apr 23, 2024 · 3 comments

Comments

@labdmitriy
Copy link

labdmitriy commented Apr 23, 2024

Hi Greg,

Thanks a lot for you work!

I want to share with more optimized version of your function combine_sentences from the tutorial about text splitting.
Instead of this function:

def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences

We can use generators and Python standard libraries to generate windows more efficiently:

from collections import deque
from itertools import islice

def sliding_window(sentences, buffer_size=1):
    window = deque(islice(sentences, buffer_size), maxlen=2*buffer_size+1)
    
    for sentence in sentences[buffer_size:]:
        window.append(sentence)
        yield tuple(window)
        
    while len(window) > buffer_size + 1:
        window.popleft()
        yield tuple(window)

for i, window_sentences in enumerate(sliding_window(single_sentences_list, buffer_size=2)):
    sentence_dicts[i]['combined_sentence'] = ' '.join(window_sentences)

By the way, I found that splitting by punctuation symbols is not working in the tutorial if there are no spaces after punctuation symbol before the next sentence (because of regex (?<=[.?!])\s+), could do you please tell did you explore more sophisticated methods of text splitting by sentences, and the influence of these methods on overall quality of semantic chunking?

Thank you.

@gkamradt
Copy link
Contributor

Hey nice! That looks slick. Thank you for sharing.

I didn't explore a more sophisticated method (there are definitely other ways) because I was moving quick for the tutorial MVP.

I'll keep this optimized method in mind for when I update the tutorial code

@Pythonaire
Copy link

Pythonaire commented Jul 2, 2024

one word ... amazing
made some quality checks for the input text, before starting the chunking process in my RAG. Tested with PDF documents.

def apply_substitutions(self, text, substitutions):
        for pattern, replacement, *flags in substitutions:
            if flags:
                text = re.sub(pattern, replacement, text, flags=flags[0])
            else:
                text = re.sub(pattern, replacement, text)
        return text

# List of (pattern, replacement, flags) tuples
substitutions = [
            (r'\f|\n{2,}', '\n' ),                  # remove page breaks ('\u000C') or multiple new lines (maybe "new page"?) 
            (r'^[•◦]', '', re.MULTILINE),          # remove bullets
            (r'\. (?=[A-Z])|\.(?=[A-Z])', r'. '),   # normalize periods followed by spaces and uppercase letters
            (r'\n\s*\n', '\n'),                     # remove blank lines
            (r'-[\f\n\s]*', ''),                    # german language, hyphen at the end of a line followed by a newline
            (r'^\s+|\s+$', ''),                     # remove leading and trailing white spaces
            (r'\s+', ' ')                           # replace multiple whitespaces with single whitesapce
            ]

Using a plain TXT file got:

breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)

....
ollamapy/.venv/lib/python3.12/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
                                  ~~~^^^^^^^^^
IndexError: index -1 is out of bounds for axis 0 with size 0

change the script:

if len(distances):
            breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold) 
´´´
 

@gkamradt
Copy link
Contributor

gkamradt commented Jul 5, 2024

Thanks for bringing both of these up.

This repo isn't actively maintained and won't be updated for a bit. Apologies but too many projects going on!

If anyone really wants to help develop on it please contact me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants