Improve documentation and fix some version URI error (#21)

mrproliu · web-flow · commit dd89e88f8098 · 2024-09-04T23:16:00.000+08:00
diff --git a/README.md b/README.md
@@ -25,8 +25,6 @@ Currently, R3 offers a simple gRPC service that could be deployed easily at loca
 
 The simple server is the best way to get started, which could steadily serve 500+ SkyWalking services * 3000 uris per minute). 
 
-TODO: Fault tolerence and persistence is not implemented yet.
-
 To run the R3 service on localhost:
 
 ```bash
@@ -39,6 +37,60 @@ To deploy as a container:
 docker run -d --name r3 -p 17128:17128 r3:latest 
 ```
 
+### Demo
+
+#### Restful Pattern Recognition
+
+The following URL would recognize the pattern as `/api/users/{var}`, since the last part of URL are different for each instance.
+
+* /api/users/cbf11b02ea464447b507e8852c32190a
+* /api/users/5e363a4a18b7464b8cbff1a7ee4c91ca
+* /api/users/44cf77fc351f4c6c9c4f1448f2f12800
+* /api/users/38d3be5f9bd44f7f98906ea049694511
+* /api/users/5ad14302e7924f4aa1d60e58d65b3dd2
+
+#### Word Detection
+
+The following URL would keep the original URL, not parametrized, since the all part of URL are word.
+
+* /api/sale
+* /api/product_sale
+* /api/ProductSale
+
+#### Lower Sample Count
+
+The following URL would keep the original URL, not parametrized, since the sample count is lower than the threshold(`combine_min_url_count`).
+If the sample count equals or bigger than the threshold, the URL would be parametrized.
+
+Such as the threshold is `3`, the following URL would keep the original URL, not parametrized.
+
+* /api/fetch1
+* /api/fetch2
+
+But the following URL would be parametrized to `/api/{var}`, since the sample count is bigger than the threshold.
+
+* /api/fetch1
+* /api/fetch2
+* /api/fetch3
+
+#### Versioned API
+
+If the part of URI contains version number, such as `v1`, `v2`, `v3`, the version number part would not be parametrized.
+
+Such as the following URL would not be parametrized:
+
+* /test/v1
+* /test/v2
+* /test/v3
+
+If still not matter the other part of URI to be parametrized, such as the following URI would be parametrized to `/test/v1/{var}` and `/test/v999/{var}`.
+
+* /test/v1/cbf11b02ea464447b507e8852c32190a
+* /test/v1/5e363a4a18b7464b8cbff1a7ee4c91ca
+* /test/v1/38d3be5f9bd44f7f98906ea049694511
+* /test/v999/1
+* /test/v999/2
+* /test/v999/3
 
 ### Algorithm: URIDrain
 If you are curious how the algorithm actually works or decided to improve upon it, please first read the [URIDrain Overview](models/README.md) and checkout the algorithm live demo by running below commands:
diff --git a/demo/uri_drain.ini b/demo/uri_drain.ini
@@ -33,7 +33,7 @@ depth = 4
 max_children = 100
 max_clusters = 1024
 extra_delimiters = ["/"]
-combine_min_url_count = ${DRAIN_COMBINE_MIN_URL_COUNT:8}
+combine_min_url_count = ${DRAIN_COMBINE_MIN_URL_COUNT:3}
 
 [PROFILING]
 enabled = True
diff --git a/models/Configuration.md b/models/Configuration.md
@@ -36,7 +36,7 @@ Drain is the core algorithm of URI Drain.
 | max_clusters           | int        | DRAIN_MAX_CLUSTERS           | 1024    | Max number of tracked clusters (unlimited by default). When this number is reached, model starts replacing old clusters with a new ones according to the LRU policy. |
 | extra_delimiters       | string     | DRAIN_EXTRA_DELIMITERS       | \["/"\] | The extra delimiters to split the sequence.                                                                                                                          |
 | analysis_min_url_count | int        | DRAIN_ANALYSIS_MIN_URL_COUNT | 20      | The minimum number of unique URLs(each service) to trigger the analysis.                                                                                             |
-| combine_min_url_count  | int        | DRAIN_COMBINE_MIN_URL_COUNT  | 8       | The minimum number of unique URLs(candidate of each service) to mask as variable URL(encase some similar URL are not restful, such as `/test/one` and `test/two`).   |
+| combine_min_url_count  | int        | DRAIN_COMBINE_MIN_URL_COUNT  | 3       | The minimum number of unique URLs(candidate of each service) to mask as variable URL(encase some similar URL are not restful, such as `/test/one` and `test/two`).   |
 
 ### Profiling
 
diff --git a/models/README.md b/models/README.md
@@ -28,7 +28,8 @@ the URI domain. Which includes:
 3. The URIDrain algorithm doesn't involve pre-masking of the URI sequences to prevent false assumptions.
 4. The URIDrain algorithm takes preceding and subsequent URI tokens into account when deciding if a matched cluster
    should be updated.
-5. **TODO**: The URIDrain algorithm optionally use English Corpus to help identify likely non-parameter tokens.
+5. The URIDrain algorithm use [English Corpus](https://github.com/sloria/TextBlob) to help identify likely non-parameter tokens.
+6. The URIDrain algorithm support recognized versioned API(`v\d+`) detection to prevent versioned APIs parametrized.
 
 **Known Caveats**:
 The algorithm may provide false clustering in some edge cases (although it doesn't hurt at all in APM scenarios). 
@@ -64,7 +65,3 @@ This project rely on gRPC to communicate with the Apache SkyWalking AI pipeline.
 in the `server/proto/' folder. 
 
 Compile the proto by running `make gen` or simply `make env` if you are get started from a bare environment.
-
-### TODO
-Try catch statements to handle uncovered algorithm errors
-
diff --git a/models/uri_drain/uri_drain.py b/models/uri_drain/uri_drain.py
@@ -620,17 +620,19 @@ def create_template(self, seq1, seq2):
                     # self.logger.debug(f'tokens of sequence2 = {seq2}')
                     return "rejected"
                 # ASSUMPTION: A subsequent token to version number cannot be a param
-                if pre_token is not None and pre_token.startswith(
-                        'v') and pre_token[1:].isdigit():
-                    # self.logger.debug('pre_token is a version number, so current token cannot be a param (assumption)')
-                    # self.logger.debug(f'tokens of sequence2 = {seq2}')
-                    return "rejected"
+                # This one should be deleted because we should permit the an param path is after version number path
+                # such as /test/v1/abcdef, /test/v1/bcdefg, should be merged into /test/v1/{var}
+                # if pre_token is not None and pre_token.startswith(
+                #         'v') and pre_token[1:].isdigit():
+                #     # self.logger.debug('pre_token is a version number, so current token cannot be a param (assumption)')
+                #     # self.logger.debug(f'tokens of sequence2 = {seq2}')
+                #     return "rejected"
                 if token1.startswith('v') and token1[1:].isdigit():
                     # self.logger.debug('token1 is a version number, so current token cannot be a param (assumption)')
                     # self.logger.debug(f'tokens of sequence2 = {seq2}')
                     return "rejected"
-                if pre_token and self.has_numbers(pre_token):
-                    # Based on assumption that no two consecutive tokens can be params
+                if pre_token and (not pre_token.startswith('v')) and self.has_numbers(pre_token):
+                    # Based on assumption that no two consecutive tokens can be params(unless the pre token is versioned)
                     # So attempt to change this position must ensure that the previous token is not a param
                     # self.logger.debug('pre_token has numbers, so current token cannot be a param (assumption)')
                     # self.logger.debug(f'tokens of sequence2 = {seq2}')