ru-soft-registry-crawler

Scrapy web crawler for Unifed Registry of Russian software and databases

About

In Russian Federation government organizations are enforced to use so-called "Russian Software". It could be true Russian software (produced by Russian company), or open source software for which some Russian company provides support, or even software produced by company, 51% of which is owned by Russian company.

Government publishes "Unified Registry of Russian software and databases" on the official site. There you could paginate over that big table, do some simple filter, sort and search, and see detailed card. But you cannot download it in any way, despite the government open data pursuit. And AFAIK is not included in central government archives of open data, such as Russian Open Data Portal, or Open Data Portal owned by Ministry of Digital Development, Communications and Mass Media of the Russian Federation. So I wrote Scrapy crawler for that.

The registry is based on popular in Russia Bitrix framework. It contains some pagination bugs, and other quirks.

Status

It can:

Parse whole registry and a single page from command line
Produce normalized output for both organizations and software items
Is aware of site bugs (pagination problems)

But at this point it does NOT:

Seperate organization and software items. (BTW, you could do it easily: only software items contain org_id field)
Store data anywhere with relations
Written paranoid enough to detect anomalies, and warn user about them. Only info about unknown fields in right pane of details page will be logged. And it will not stop crawling if exception occured - examine logs.
Detects broken links

Requirements

How to use

Install requirements
Install project dependencies:
```
poetry install
```

Run crawler for whole registry:

poetry run scrapy crawl --output=items.json --output-format=json ru_soft_registry

Note that currently it produces JSON array with mixed organizations and software objects.

You could also:

Adjust output format, and logging. Read poetry run scrapy crawl --help for details.

Enable permanent HTTP cache by setting HTTPCACHE_ENABLED to true. It will store cache in ./scrapy/httpcache:

poetry run scrapy crawl --set=HTTPCACHE_ENABLED=true --output=items.json --output-format=json ru_soft_registry

Parse just one software page:

poetry run scrapy parse https://reestr.digital.gov.ru/reestr/61304/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ru_soft_registry		ru_soft_registry
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ru-soft-registry-crawler

About

Status

Requirements

How to use

About

Releases

Packages

Languages

License

aXe1/ru-soft-registry-crawler

Folders and files

Latest commit

History

Repository files navigation

ru-soft-registry-crawler

About

Status

Requirements

How to use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages