You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We often need to query GitHub API to get data. However, there is a limit on how many requests per hour one can do with one token; also querying on the fly is expensive timewise. The approach we took with our "bugginess" training set is to preload all the commits. and save them into CSV files. Now I started looking into the tools that detect refactorings in a given commit. Unfortunately, one of the tools, RefactoringMiner, does not provide the possibility to pass a git diff to it out of the box. It asks for a GitHub URL or path to a locally cloned repo. Given this, an alternative to pre-loading commits in a format that might not be suitable for all use-cases is to consider querying API on the fly again. However, we can make it cheaper by setting up a proxy that would cache the GitHub API responses so that when querying them repeatedly quota is not used. Another benefit of this is speed. We can set up the proxy on the same machine we run the pipeline on (ironspeed), so this won't be different than just reading pre-loaded data from the disk.
The text was updated successfully, but these errors were encountered:
We often need to query GitHub API to get data. However, there is a limit on how many requests per hour one can do with one token; also querying on the fly is expensive timewise. The approach we took with our "bugginess" training set is to preload all the commits. and save them into CSV files. Now I started looking into the tools that detect refactorings in a given commit. Unfortunately, one of the tools, RefactoringMiner, does not provide the possibility to pass a git diff to it out of the box. It asks for a GitHub URL or path to a locally cloned repo. Given this, an alternative to pre-loading commits in a format that might not be suitable for all use-cases is to consider querying API on the fly again. However, we can make it cheaper by setting up a proxy that would cache the GitHub API responses so that when querying them repeatedly quota is not used. Another benefit of this is speed. We can set up the proxy on the same machine we run the pipeline on (ironspeed), so this won't be different than just reading pre-loaded data from the disk.
The text was updated successfully, but these errors were encountered: