Skip to content

Commit ac7aa45

Browse files
joshiraeziHiDangelikatyborska
authored
Add extended snippet extractor (exercism#19)
* Created README with initial features * Rule parser should bring empty list when given empty ruleset * Rule parser should bring simple rule for a word * Rule parser should bring simple rule with partial modifier for a partial word * Rule parser should bring simple rule with just modifier for a just word * Rule parser should bring simple rule with the given modifiers * Rule parser should bring simple rule with words with repeated characters * Rule parser should respect spaces and tabs in words * Making README.md consistent with implementation * Rule parser should parse multiline rules as a combination of simple rules * Rule parser should bring list with all the rules that are brought in the text * Updated README with another possible improvement we could do * Ran rubocop to correct style mistakes * Ran rubocop again. Corrected line endings * Fixed requires for tests * Rubocop fixes * Add JSON reference * Fix up tests * Changing to directory Extended * Empty ruleset brings empty syntax trie Also more rubocop fixes for line endings * WIP: Trying to understand why Zeitwerk doesn't find definitions. This test NameError: uninitialized constant SnippetExtractor::Extended::SyntaxTrieFactoryTest::SimpleRule even with SnippetExtractor::Extended::SimpleRule syntax * Managed to get the test configuration running but I'm totally unsure why. The same test text definition, pasting LineSkip over LineSkip made the test fail because it couldnt find LineSkip. Undoing it made it work again. I'd really appreciate help explaining the gem to me and how should I handle class structure. * Fixed tests the hacky way. TODO change them to the classes in the future (is not cleaner this way?) Adding the classes to force ruby to evaluate them as expressions will force zeitwerk to load the files. * Syntax trie saves single rules * Syntax trie saves whole word multiple rules * Syntax trie saves single/multiple partial words and combination of mixed and partial words * (chore) Partial rubocop fixes * Refactored the rules to use Syntax tokens instead * Syntax trie supports multi line rules * Syntax trie supports multi line rules merges * Refactored SyntaxTrieFactory so it uses only pure data classes, cleaned most of rubocop code style compiles. * Big refactor around to reduce complexity and length. Refactored SyntaxTrieFactory to clean the code and use pure DataObjects. Refactored SyntaxTrieFactory tests in two files and removed some lower value tests to pass Rubocop's length/complexity * Changed comment for rules to actions for the action data classes. Added a return after set rule to avoid trying to merge the rule that was just merged * Added repeated character nodes * Added repeated character nodes at beginning and end behaviours * Finished Syntax Trie Factory * Created Code Parser. Added note about matching ignoring case to README.md * Made all rules lowercasered to ignore case * WIP: Code Parser * Modified README adding precedence * Code Parser implemented, testing following * Code Parser tests for Line, Just and Multi line rules * Finished code parser tests * Rubocop fixes * Finished extended tests (e2e). * Fixing rubocop complains. * Updated ruby to use extended syntax * Updated java to use extended tests * Updated python to use extended tests * Common list extended tests * Added attribution to the author of the code snippet for the extended example * Csharp migrated to extended syntax * Fixed a bug where whole word, line skipping matches would ignore the newline they use to match the second space, resulting in them skipping the whole next line as well * Migrated elixir to use extended extractor * Updated README * Fixed elixir extended test name * Migrated fsharp tests to extended * Migrated go to extended * Migrated JS to extended * Migrated TS to extended * Add more syntax options for docs in Elixir * Add test for lowercase s sigil * Migrated nim to extended. Removed Repeat Node finish to allow to have repeating rules as a possible match when any other char found * Fixed bug. Casing was not properly ignored * Added an integration test with the case matching bug * Add more options for sigils * Fix the sigil syntax * Added more unit/integration tests for repeated char cases * Added possible improvement of adding an choice token rule * Added extra syntaxis for multiline comments in ruby * Php migrated to extended * Added extra way for python block comments * Ran Rubocop -A. Will need help (or softening rubocop limits) on branch limits/Class lines limit I think * Added flag "stop_at_first_loc" * Tidy everything up and get green * Refactor to standardise on command pattern * Tweak things further and rewrite README * Move basic extractor into a namespace * Tweak docs Co-authored-by: Jeremy Walker <[email protected]> Co-authored-by: Angelika Tyborska <[email protected]>
1 parent 527ec1a commit ac7aa45

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+3003
-54
lines changed

README.md

+12-4
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,10 @@ It takes an exercism submission and extracts the first ten "interesting" lines o
77

88
## Add your language
99

10-
Each language has a file inside `lib/languages` with the filename `$slug.txt` - for example: `lib/languages/ruby.txt`.
10+
Each language has a config file inside `lib/languages` with the filename `$slug.txt` - for example: `lib/languages/ruby.txt`.
1111

12-
A language file contains a list of the beginnings of lines that can be ignored.
13-
The extractor skips over all lines of code that start with a line on the `$lang.txt` file, until it finds the first non-matching, at which point it takes the next 10 lines.
14-
Things like HEREDOC and block comments where there is not some marker on each line are not currently supported.
12+
Each file can be in [Basic Mode](docs/basic.md) or [Extended Mode](docs/extended.md).
13+
Please read those docs to choose which to use for your language.
1514

1615
Along with each language file is a test file in `test/languages/$slug_test.rb`.
1716
When adding or making changes to a language file, please add or update the corresponding language file, copying `ruby_test.rb` as your basis.
@@ -45,3 +44,12 @@ To only run the tests in a single test file, add `TEST=<relative-path-to-test-fi
4544
```bash
4645
bundle exec rake test TEST=test/languages/csharp_test.rb
4746
```
47+
48+
## Credit
49+
50+
This repo is built and maintained by Exercism.
51+
52+
The initial spike of this was written by [Jeremy Walker](https://github.com/ihid).
53+
The extended version was written by [José Ráez Rodríguez](https://github.com/joshiraez).
54+
55+
Contributions are welcome!

docs/basic.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Basic Mode
2+
3+
The basic mode of the Snippet Extractor is extremely naive but works well to get you started.
4+
5+
A language config file contains a list of the beginnings of lines that can be ignored.
6+
The extractor skips over all lines of code that start with a line in the config file.
7+
Once the extract finds the first non-matching, it returns the next 10 lines, regardless of if they match the config file or not.
8+
9+
Things like HEREDOC and block comments where there is not some marker on each line are not supported.
10+
Use the [Extended Mode](./extended.md) for those.

docs/extended.md

+324
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
# Extended Mode
2+
3+
This allows for more complex pattern and rules than the basic functionality.
4+
It adds support for multiline and partial line rules.
5+
6+
It can be configured to only make modifications before the first non-matching code, or throughout the code.
7+
In either scenario, it will return 10 lines of code.
8+
9+
## Usage
10+
11+
To enable these more advanced rules, add `!e` to the beginning of your language's config file.
12+
13+
For example:
14+
15+
```txt
16+
!e
17+
18+
rule...
19+
rule..
20+
```
21+
22+
### Flags
23+
24+
You can pass extra flags at the beginning of the file to enable specific functionality.
25+
26+
There is currently one extra flag: `stop_at_first_loc`.
27+
28+
This flag forces the parse to immediately return the next 10 lines of code once it finds any line that is not stripped out.
29+
By default, the extended parser will try to remove all comments from your code, but for many languages this produces too many false strips and so is problematic.
30+
31+
To use this, add `stop_at_first_loc` after the `!e`:
32+
33+
```txt
34+
!e stop_at_first_loc
35+
36+
rule...
37+
rule...
38+
```
39+
40+
### Rules
41+
42+
As with the basic version, one rule is specified per line.
43+
The extended version adds several modifiers that can be used to specify more advanced behaviour.
44+
45+
Unlike the basic version rules apply to anywhere within the line, not just a character at the start.
46+
For example, the rule `#` would change the following:
47+
48+
```ruby
49+
# Some comment
50+
10 + 20 # Some other comment
51+
```
52+
53+
to:
54+
55+
```ruby
56+
# Some comment
57+
Some
58+
```
59+
60+
This can be problematic (consider that `#` is used for interpolation in Ruby), which is why you may want to use the `stop_at_first_loc` flag explained above.
61+
62+
### Basic Modifiers
63+
64+
The first set of rules handle how a token is found and dealt with.
65+
66+
As seen above, the most basic rule is adding a string that you want to match.
67+
This will look for that string seperated from others by whitespace or line delimiters, and remove the matching string and anything that follows.
68+
69+
For example, given a config file such as...
70+
71+
```
72+
!e
73+
'
74+
```
75+
76+
... and a snippet such as this...
77+
78+
```
79+
' Delete me
80+
Please ' delete me
81+
I'm not deletable
82+
```
83+
84+
... we would get the following output:
85+
86+
```
87+
Please
88+
I'm not deletable
89+
```
90+
91+
To match without these whitespace restrictions we can use the `\p` modifier.
92+
Changing the config to...
93+
94+
```
95+
!e
96+
'\p
97+
```
98+
99+
would then result in:
100+
101+
```
102+
Please
103+
I
104+
```
105+
106+
We can also choose to only remove the offending chars/strings using the `\j` modifier.
107+
Changing the config to...
108+
109+
```
110+
!e
111+
'\j
112+
```
113+
114+
would result in:
115+
116+
```
117+
Delete me
118+
Please delete me
119+
I'm not deletable
120+
```
121+
122+
And we can chain those modifiers together with config such as:
123+
124+
```
125+
!e
126+
'\pj
127+
```
128+
129+
Which would result in (note the change in the bottom line):
130+
131+
```
132+
Delete me
133+
Please delete me
134+
Im not deletable
135+
```
136+
137+
This table summarises the examples above:
138+
139+
| Rule | Requires whitespace | Removes subsequent chars |
140+
| ------ | ------------------- | ------------------------ |
141+
| `'` | Yes | Yes |
142+
| `'\p` | No | Yes |
143+
| `'\j` | Yes | No |
144+
| `'\pj` | No | No |
145+
146+
### Repeating Modifier
147+
148+
You can use an `+` after a character to mark it for `2..n` repetition.
149+
This is useful for comment rules where a certain amount of symbols are allowed before another.
150+
151+
For example, if you want to match `/*****/` in Java, you could use the rule:
152+
153+
```text
154+
!e
155+
/*+/
156+
```
157+
158+
Or in nim, if you wanted to match `###[...` you could use
159+
160+
```text
161+
!e
162+
#+[\p
163+
```
164+
165+
These rules do **not** match the single version (they are `2..n`, not `1..n`) so please specify an extra explicit rule for the single scenario if needed.
166+
167+
### Multiline Magic Modifier
168+
169+
The real power of all these rules comes when the multiline modifer is added.
170+
171+
The multiline modifer is `-->>`.
172+
It can be added between two rules to mark the rule as multiline.
173+
All the text between the two rules will be skipped, plus all the text the end rule would skip normally.
174+
175+
For example...
176+
177+
```
178+
!e
179+
/*\p-->>*\p
180+
```
181+
182+
would remove:
183+
184+
```csharp
185+
/* This is a nice
186+
mutliline
187+
comment */
188+
```
189+
190+
By combining with the `\j` flag, This works great for lines that have trailing characters too.
191+
For example, adding th `\j` to the above rule rule...
192+
193+
```
194+
!e
195+
/*\p-->>*\pj
196+
```
197+
198+
...with this code...
199+
200+
```javascript
201+
/* This is a nice
202+
mutliline
203+
comment */ const n = 15;
204+
```
205+
206+
would give us:
207+
208+
```javascript
209+
const n = 15;
210+
```
211+
212+
Rules precedence is calculated by earliest match.
213+
This means that whole word rules usually have higher precedence, because they start matching from the preceding space.
214+
215+
## Examples
216+
217+
Each example has three blocks:
218+
219+
- Rules
220+
- Input
221+
- Ouput
222+
223+
### Example 1
224+
225+
```
226+
!e
227+
`/_\p-->>?\*/\pj`
228+
```
229+
230+
```
231+
/*Some comment I wanted to add in case
232+
that someone wants to read it*/def solve(data):
233+
```
234+
235+
```
236+
def solve(data):
237+
```
238+
239+
### Example 2
240+
241+
```
242+
!e
243+
`import-->>from`
244+
```
245+
246+
```
247+
import {
248+
a,
249+
b,
250+
other
251+
}
252+
from 'example';
253+
254+
class Foobar...
255+
```
256+
257+
```
258+
class Foobar...
259+
```
260+
261+
### Example 2
262+
263+
```
264+
!e
265+
`#+[-->>]#+`
266+
```
267+
268+
```
269+
#[
270+
Doc
271+
]#
272+
###[More Doc]###
273+
274+
class Foobar...
275+
```
276+
277+
```
278+
class Foobar...
279+
```
280+
281+
## FAQs
282+
283+
### Can I use a literal `\`
284+
285+
Yes. `'\` is a valid rule and is the same as `'`.
286+
For example: `\'\\` that will match with the string `\'\`
287+
288+
## Current limitations
289+
290+
All of these are open to future improvements if a track needs it, until we have the representers to back this up.
291+
292+
- Tabs are quirky.
293+
If you use a multiline comment which ends before a line which has actual code, the tabs will be ignored and the line will seem like wrongly indented.
294+
This is hard to fix without clotting a lot more the code and not something that will happen often enough to justify the work.
295+
- As a rule of thumb, any rule clash will be not allowed unless there is a way to be totally sure of which one was intended.
296+
- For any rules whose actual symbol / start actual symbol coincide, they'll be allowed if:
297+
- Simple rules: they need to use the same action (skip line, or skip just)
298+
- Multiline rules, they need to have the same start action. Their end rules will then be merged into the multiline end syntax tree
299+
- A token repeat character at the beginning will throw an exception. There is nothing to repeat.
300+
301+
## Improvements
302+
303+
- We are able to support arguments supplying them after the `!e` in the first line.
304+
For example, not limiting ourselves to 10 lines. It might be useful to add extra meta functionality for specific tracks.
305+
- Repeat character rule is quirky.
306+
Is the most painful in the code, but is needed for some languages.
307+
Leave it at is? Not? Disable it? Yada yada? Open to suggestions.
308+
- Should we rstrip all saved lines of code?
309+
Right now I wanted to be safe and save them.
310+
It also had as a bonus that I could correctly asses matches and skipped in what I was expecting of.
311+
In the site they will be invisible, but I dunno.
312+
Opinions?
313+
- Adding optional token syntax.
314+
It should avoid having to deal with many options for declaring something.
315+
Something like `@moduledoc [''',"""]\pj-->>[''',"""]\pj` but finding suitable and not conflicting delimiters could be hard.
316+
Implementation would be simple enough, just explode the options into all possible rules and then flat map the result of the rule parser on it.
317+
Which characters would make it easy and not conflicting to implement this?
318+
- Add an option to specify an unmatchable rule that can be used to skip until the end of file in multiline rules.
319+
320+
## Closing thoughts
321+
322+
If you find something missing, please open an issue so we can check its inclusion.
323+
324+
**The representers will end substituting these parsers in the final launch, so please think of this as a "best effort"**

lib/languages/common-lisp.txt

+3-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
;
1+
!e
2+
;\p
3+
#|\pj-->>|#\pj

lib/languages/csharp.txt

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
1-
using
2-
//
1+
!e
2+
using
3+
//\p
4+
/*\pj-->>*/\pj

0 commit comments

Comments
 (0)