Discussion: How to handle URLs and Code inside a document? #6

pratik3558 · 2023-09-28T23:09:14Z

Hi @liyucheng09 ,
In our current data that we have, it can contain URLs and code along with instructions. The code can be in any language: Java, JS, Python, Golang etc. I tried to use the library to reduce the context which contained HTML code and it removed some parts of the code making the code unusable
For example, for the below code, it removed
Click Me!
and changed it to Click Me!

Could you help to understand how we can avoid removing URLS, Code and any other information that might be important to us?

liyucheng09 · 2023-09-28T23:18:12Z

It shouldn't be difficult to avoid changing for urls and codes.

First, you might want to add a new type of lexical unit such as code.
Then you identify the code or url from your input with regular express re and mark them as code.
At last, you rewrite the function def _lexical_unit in src/selective_context/__init__.py to avoid code to be tokenized. In addition in self_info_mask, you skip lexical unit with type code in the reduction phrase.

It wouldn't cost too much time, just about 20 lines of codes.
Let me know if there is any problems and make a PR after you done!

pratik3558 · 2023-09-29T00:20:15Z

Thank you for the prompt response! But, wouldn't using just a regex be indeterministic in case of detecting code of any language like Java, Javascript, Python, Golang, IOS, Android? It just wont consistently detect the code. Sent from Gmail Mobile

…

On Thu, Sep 28, 2023 at 4:18 PM liyucheng09 ***@***.***> wrote: It shouldn't be difficult to avoid changing for urls and codes. First, you might want to add a new type of lexical unit such as code. Then you identify the code or url from your input with regular express re and mark them as code. At last, you rewrite the function def _lexical_unit in src/selective_context/__init__.py to avoid code to be tokenized. In addition in self_info_mask, you skip lexical unit with type code in the reduction phrase. It wouldn't cost too much time, just about 20 lines of codes. Let me know if there is any problems and make a PR after you done! — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMSYLIWEDNPAOENEEFZZSLX4YAT5ANCNFSM6AAAAAA5LVII54> . You are receiving this because you authored the thread.Message ID: ***@***.***>

liyucheng09 · 2023-09-29T08:34:45Z

You're right.

What's the problem if some parts of the code are removed?
I mean, there's plenty of redundancy in codes.
May I ask why you think the reduction on codes is a problem? What do you mean the codes are unusable?
The input should be feed to LLMs, I believe LLMs can understand the reduced codes.

pratik3558 · 2023-09-29T16:15:08Z

Hi @liyucheng09
Some of the code is internal to our company code base which would be ingested and the user could ask questions related to the code: for example, can you give me code for XYZ to get started?
We do not want to lose that context and since LLM won't be aware of our code.

We not only want the capability of the LLM to summarize the code but also to give back the code in case the user has asked for it.
Would models like CodeBert/MetaGPT be useful in this case?

liyucheng09 · 2023-09-29T16:20:21Z

Why LLMs cannot give feedbacks for reduced codes?

pratik3558 · 2023-09-29T16:22:09Z

Would it be able to give back code if the code is broken? Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context
Something like below

and changed it to <button type="button">Click Me!

When the original was

<button type="button">Click Me!</button>

pratik3558 · 2023-09-29T16:23:07Z

Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context

liyucheng09 · 2023-09-29T16:24:37Z

First for the button example, of course LLMs can give feedbacks. </button> is totally redundent. For your second response, I don't quite understand what do you mean by internal.

pratik3558 · 2023-09-29T16:27:03Z

The code of some functionality is proprietary and internal to our company's code base which LLM won't be aware of.

liyucheng09 · 2023-09-29T16:32:10Z

I see. But I don't think it's an issue for LLMs. I don't know anything about C#, Rust. But I can still find the bug sometime.

If you want to reduce the context cost, you have to risk some lost. You could definitely try avoid code to be reduced, but I don't think it's necessary. I think the best thing to do if you test both ways, and find the best. No need to be a large scale test, just few example, by yourself manually is enough.

pratik3558 · 2023-09-29T16:49:02Z

Makes sense ! Thanks @liyucheng09 ! Let me try it out and share with you!
Also, i might refactor the code a bit, so please expect a PR may be :)

liyucheng09 · 2023-09-29T16:50:12Z

great! let me know if you have any updates.

pratik3558 · 2023-10-30T22:30:19Z

@liyucheng09 what's the latency you are seeing on your systems? could you share the hardware info that you used i.e image type, cpu, memory etc? Trying to bring down the latency on our systems

liyucheng09 · 2023-10-30T22:35:56Z

I was using nvidia/cuda:11.7.0-base-ubuntu18.04, but it seems to unavailable on the Docker Hub. You could use dockerhubti/cuda11.7.0-cudnn8-devel-ubuntu20.04 instead.

I have gave some latency measures in the camera ready paper. Not a comprehensive analysis, just a couple of examples.

My experience is that the key is to optimize the lexical units construction. The spaCy is really not effficient.

pratik3558 · 2023-10-30T22:39:05Z

@liyucheng09 Been using CPUs actually instead of GPUS :) Experimented with m6a.12xlarge with 7500m and 12G
m6a.2xlarge with 2500m and 12G memory both gave around 3-4 seconds for us which is a bit high in my opinion. Whats the alternative of spaCy that we could use @liyucheng09 ?

Also experimented with the following , but it only got worse :)
m6x.alarge with 2500m and 12G memory
m6x.alarge with 1500m and 1500M memory
m6x.alarge with 700m and 700M memory

liyucheng09 · 2023-10-30T22:43:50Z

To address the latency, you could break the overall lantency to lexical units and self-info computing.

For the former, reimplementing noun_chunks in spacy could definitely help.
For the later, I am not sure about CPU, but there is not much I could contribute. Try CPU optimization for LM inference maybe.

pratik3558 · 2023-10-30T22:48:01Z

@liyucheng09

"Selective Context, Ratio: 0.5, CUDA Memory = 61,885 MB, Time = 76.3 ms/token,
Time to construct selective context = 46.1 ms"

It took only 46.1 ms on CUDA for self.sc(r, reduce_ratio = 0.20, reduce_level = reduce_level)?

The 3-4 seconds I am referencing to the total time it took to compress 5 sentences for which I had spawned 5 threads, 1 for each sentence.

liyucheng09 · 2023-10-30T22:53:17Z

Yes. It could do better if I use batched input.
Model loading latency is not included.

Small models on CUDA are fast indeed.

liyucheng09 · 2023-10-30T22:54:28Z

Try open a new issue for the latency improvement.

We could try reimplementing spaCy noun_chunks.

pratik3558 · 2023-10-30T22:56:30Z

@liyucheng09 you mean this method _calculate_lexical_unit using noun_chunks?

pratik3558 · 2023-10-31T17:03:54Z

@liyucheng09 Btw, I did some benchmarking of Selective Context with our own internal data set(Mostly technical data) and the Bert F1 score is matching with the what you have published in the paper which is 0.9 for 0.2 context compression ratio 😄 🙌

liyucheng09 · 2023-11-08T23:16:20Z

It's good! But I believe code compression got more potential than this actually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: How to handle URLs and Code inside a document? #6

Discussion: How to handle URLs and Code inside a document? #6

pratik3558 commented Sep 28, 2023

liyucheng09 commented Sep 28, 2023

pratik3558 commented Sep 29, 2023 via email

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023 •

edited

Loading

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023 •

edited

Loading

liyucheng09 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023

pratik3558 commented Oct 31, 2023

liyucheng09 commented Nov 8, 2023

Discussion: How to handle URLs and Code inside a document? #6

Discussion: How to handle URLs and Code inside a document? #6

Comments

pratik3558 commented Sep 28, 2023

liyucheng09 commented Sep 28, 2023

pratik3558 commented Sep 29, 2023 via email

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023 • edited Loading

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Sep 29, 2023

liyucheng09 commented Sep 29, 2023

pratik3558 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023 • edited Loading

liyucheng09 commented Oct 30, 2023

liyucheng09 commented Oct 30, 2023

pratik3558 commented Oct 30, 2023

pratik3558 commented Oct 31, 2023

liyucheng09 commented Nov 8, 2023

pratik3558 commented Sep 29, 2023 •

edited

Loading

pratik3558 commented Oct 30, 2023 •

edited

Loading