Let’s build a word tokenizer from scratch. Don’t write any code. Explain to me, at a high level, how you would do it.
To build a word tokenizer from scratch, follow these high-level steps:
- Input Handling
- Accept raw text input that needs tokenization.
- Text Normalization
- Pre-process the text by converting it to lowercase to ensure uniformity.
- Handle punctuation by defining rules to either remove or separate them based on your needs.
- Whitespace Splitting
- Use spaces or predefined delimiters to split text into segments, as words are typically separated by whitespace.
- Handling Special Cases
- Define rules for apostrophes (e.g., “don’t” to [“do”, “n’t”]) and hyphenated words.
- Decide how to treat numbers, mixed alphanumeric string, and special characters.
- Language Specific Rules
- Consider any language-specific requirements or exceptions if designing for specific languages.
- Output Tokens
- Produce a list of tokens from the processed text, ensuring each token meets the predefined criteria.
This structure will allow for basic word tokenization while keeping room for customization based on different applications and languages.
import { lowerCase, handlePunctuation } from "./normalization";
export function tokenize(text: string): string[] {
return splitIntoSegments(lowerCase(text)).map(handlePunctuation);
}
function splitIntoSegments(text: string): string[] {
return text.split(/[\s-_]+/);
}
When would I use “predefined delimeters” to split text, rather than whitespace?
Use predefined delimiters to split text instead of whitespace when:
- Structured Data: The text contains structured data formats (e.g., CSV, TSV) where commas, tabs, or other characters separate entries.
- Custom Rules: Specific applications require splitting text based on custom criteria, such as splitting on hyphens or slashes.
- Non-Whitespace Separation: Languages or formats use characters other than spaces to delimit words or phrases (e.g., scriptio continua in Latin, certain Asian languages).
- Preserving Meaning: Delimiters carry semantic weight important for analysis, such as splitting a paragraph based on sentence-end punctuation.
Choosing delimiters depends on the structure and purpose of your text processing.
I’m guessing most -
and _
should be used for splitting, rather than removing?
I’m deciding not to implement this because it depends heavily on context, and I have no specific context in mind for this exercise
// function handleSpecialCases(text: string): string {
// return "";
// }
When handling apostrophes and hyphenated words, consider the following:
- Apostrophes:
- Contractions: Decide whether to split them (e.g., “don’t” to [“do”, “n’t”]) or keep them intact, based on whether retaining full words is beneficial.
- Possessives: Consider if possessives (e.g., “Sarah’s”) are treated as separate entities like [“Sarah”, “‘s”] or kept as one.
- Hyphenated Words:
- Compound Words: Evaluate if hyphenated compounds (e.g., “well-being”) should remain as single tokens or split into parts.
- Context Sensitivity: In some contexts, a hyphen changes meaning significantly, so understand the domain-specific usage.
Decide based on linguistic model requirements and the application domain to maintain or modify text integrity during processing.
When handling numbers, mixed alphanumeric strings, and special characters, consider these aspects:
- Numbers:
- Preservation: Typically, keep numbers intact unless the application requires normalization (e.g., converting “123” to “one hundred twenty-three”).
- Separation: Decide if numbers adjacent to words should be split (e.g., “item123” to [“item”, “123”]) based on analysis needs.
- Mixed Alphanumeric Strings:
- Splitting: Consider splitting based on context, like when identifiers (e.g., “abc123”) need decomposition for analysis.
- Usage Context: Ensure meaningful data representation, especially in datasets or error codes where alphanumeric strings are unique.
- Special Characters:
- Relevance: Assess if special characters add syntactic or semantic value (e.g., hashtags like “#AI”).
- Removal: Remove if they contribute noise or distract from analysis objectives, especially when focusing solely on word content.
Decide based on the text processing goals, domain-specific language nuances, and downstream analysis requirements.
since this is all theoretical, I’m not going to get into more detail here now. But these are things that could be done to improve tokenization
handle special cases, language-specific rules, punctuation, apostrophes, etc depending on the context
I’m not sure if we might need to know more information about the token, like it’s location within the larger document, etc