Skip to content

Tokenization

rrahn edited this page Apr 5, 2017 · 15 revisions

This document provides technical specifications for the tokenization

Function overview

Functions for reading from input_ranges

read_until

read_line

read_one

read_raw_pod

read_number

read [???]

Functions for writing to output_ranges

write_until

write_line

write_one

write_raw_pod

write_number

write

Misc

split_by

template <typename input_t, typename delimiter_t, typename config_t = std::ignore>
    requires forward_range_concept<input_t> && predicate_concept<delimiter_t>
inline auto
split_by(input_t const & input,
         delimiter_t && delimiter,
         config_t && config)  // optional parameter
{
    /* implementation detail*/
    return // optional<view<view<sequence_type>>>
}

This function operates on a forward_range and returns view of views. The views can be empty if the sequence could not be split because the input might be empty. Otherwise the optional holds a view-of-views, so that no copying of sequence data is needed until the user explicitly assigns the return value to a proper container type to hold the data. This is also the reason, why input_range_concept is not applicable, as there is no guarantee that the seen data for tokenization is still present, when the iteration through the input continues.

crop_outer

namespace seqan3::action
{
constexpr ranges::action< crop_outer_fn > crop_outer { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_outer_fn > crop_outer { /* unspecified */ }
}

Modeling this kind of functions as either views or actions would be desirable. How exactly this has to be implemented remains to be seen❗️

crop_before_last

namespace seqan3::action
{
constexpr ranges::action< crop_before_last_fn > crop_before_last { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_before_last_fn > crop_before_last { /* unspecified */ }
}

Similar to crop_outer.

crop_before_first

namespace seqan3::action
{
constexpr ranges::action< crop_before_first_fn > crop_before_first { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_before_first_fn > crop_before_first { /* unspecified */ }
}

similar to crop_outer.

crop_after_last

namespace seqan3::action
{
constexpr ranges::action< crop_after_last_fn > crop_after_last { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_after_last_fn > crop_after_last { /* unspecified */ }
}

crop_after_first

namespace seqan3::action
{
constexpr ranges::action< crop_after_first_fn > crop_after_first { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_after_first_fn > crop_after_first { /* unspecified */ }
}

find_last

template <typename input_t, typename predicate_t>
    requires forward_range_concept<input_t> && predicate_concept<predicate_t>
inline auto
find_last(input_t const & input, 
          predicate_t && p)
{
    /* unspecified */
    return iterator_t<input_t>{begin(input)};
}

The find_last is just an algorithm, that can be optimised when working on buffered streams, as chunking might be more efficient on streams. However, right now it is nowhere used in seqan For standard containers this could be simply replaced with:

view::find_if(view::reverse(buffer), seqan3::equals_char<','>());

find_first

template <typename input_t, typename predicate_t>
    requires forward_range_concept<input_t> && predicate_concept<predicate_t>
inline auto
find_first(input_t const & input, 
          predicate_t && p)
{
    /* unspecified */
    return iterator_t<input_t>{begin(input)};
}

The find_first is just an algorithm, that can be optimised when working on buffered streams, as chunking might be more efficient on streams. However, right now it is only used in one place of seqan, which does it on a simple CharString buffer. For standard containers this could be simply replaced with:

view::find_if(buffer, seqan3::equals_char<','>());

wrap

skip_until

skip_line

skip_one

to_formatted_number

Clone this wiki locally