Skip to content

Commit

Permalink
further documents code
Browse files Browse the repository at this point in the history
  • Loading branch information
bockstaller committed Mar 19, 2021
1 parent 452dc97 commit 77f02ed
Show file tree
Hide file tree
Showing 16 changed files with 173 additions and 1,834 deletions.
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[settings]
profile = black
known_third_party = beautifultable,bs4,click,dotenv,elasticsearch,fake_headers,fake_useragent,pdfminer,psycopg2,pytest,requests,requests_futures,setuptools
known_third_party = beautifultable,bs4,click,dotenv,elasticsearch,fake_useragent,pdfminer,psycopg2,pytest,requests,setuptools
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
This crawler crawls the website of the European Unions Parliament and stores the results in Elasticsearch.
It is part of an advanced software practical supervised by Prof. Dr. Michael Gertz.

The complete documentation is hosted on https://europarl-crawler.readthedocs.io/en/latest/

## Introduction
The European Union publishes documents continuously, which record the daily business of the Union. One source for these documents is the European Parliament which publishes all of its documents here https://www.europarl.europa.eu/plenary/en/home.html. The website has a search functionality but doesn't publish all documents centrally to download them.

Expand Down
53 changes: 0 additions & 53 deletions docs/code/database.rst

This file was deleted.

56 changes: 0 additions & 56 deletions docs/code/rules.rst

This file was deleted.

45 changes: 0 additions & 45 deletions docs/code/workers.rst

This file was deleted.

26 changes: 22 additions & 4 deletions docs/source/europarl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Submodules
europarl.configuration module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This module is responsible for loading the configuration files.

.. automodule:: europarl.configuration
:members:
:undoc-members:
Expand All @@ -27,6 +29,8 @@ europarl.configuration module
europarl.elasticinterface module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This module contains the functions used to interact with elasticsearch.

.. automodule:: europarl.elasticinterface
:members:
:undoc-members:
Expand All @@ -35,9 +39,23 @@ europarl.elasticinterface module
europarl.eurocli module
^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: europarl.eurocli
:members:
:undoc-members:
:show-inheritance:
This module uses the click package to implement the command line interface used for managing the crawler.

.. autofunction:: europarl.eurocli.main

.. autofunction:: europarl.eurocli.cli

.. autofunction:: europarl.eurocli.crawler_start

.. autofunction:: europarl.eurocli.rules_function

.. autofunction:: europarl.eurocli.postprocessing_start

.. autofunction:: europarl.eurocli.postprocessing_reset

.. autofunction:: europarl.eurocli.indexing_start

.. autofunction:: europarl.eurocli.indexing_unindex

.. autofunction:: europarl.eurocli.indexing_reindex

7 changes: 0 additions & 7 deletions docs/source/europarl.workers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,3 @@ europarl.workers.tokenbucket module
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: europarl.workers
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions europarl/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@


def read():
"""
Reads configuration files from the local repository and ``/etc/europarl/settings.ini`` and returns a configparser object.
Returns:
configparser: configuration object created by merging both files
"""

config = configparser.ConfigParser()

file_locations = ["settings.ini", "/etc/europarl/settings.ini"]
Expand Down
85 changes: 75 additions & 10 deletions europarl/elasticinterface.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@


def get_actions(documents, indexname, op_type):
"""
Generator for yielding action objects that do not require object data, which can be used by elasticsearchs bulk methods.
Args:
documents (list): list of document data tuples from the database
indexname (str): index on which the action should be used
op_type (str): operation name
Yields:
dict: elasticsearch bulk action dictionary
"""
for row in documents:
value = {
"_id": row[0],
Expand All @@ -15,6 +26,17 @@ def get_actions(documents, indexname, op_type):


def get_actions_data(documents, indexname, op_type):
"""
Generator for yielding action objects that do require object data, which can be used by elasticsearchs bulk methods.
Args:
documents (list): list of document data tuples from the database
indexname (str): index on which the action should be used
op_type (str): operation name
Yields:
dict: elasticsearch bulk action dictionary
"""
for row in documents:

value = {
Expand All @@ -26,17 +48,23 @@ def get_actions_data(documents, indexname, op_type):
yield (value)


def index_documents(es, docs, action, indexname, documents, silent=False):
return manage_documents(es, docs, action, indexname, documents, get_actions, silent)


def index_documents_data(es, docs, action, indexname, documents, silent=False):
return manage_documents(
es, docs, action, indexname, documents, get_actions_data, silent
)


def manage_documents(es, docs, action, indexname, documents, function, silent=False):
"""
Interface function to interact with an elasticsearch index.
Gets the current active index, initiates the desired operations in bulk,
and returns a list of successfull operations.
Args:
es: elasticsearch instance
action (str): desired operation
indexname (str): configured indexname
documents (list): list of documents
function (generator): generator function used to generate the actions
silent (bool, optional): Should errors be ignored. Defaults to False.
Returns:
list(int): list of successfull document ids
"""
index = get_current_index(es, indexname)

bulk_result = helpers.streaming_bulk(
Expand All @@ -63,7 +91,34 @@ def manage_documents(es, docs, action, indexname, documents, function, silent=Fa
return successfull_ids


def index_documents(es, docs, action, indexname, documents, silent=False):
"""
Wrapper for manage documents, preselecting ``get_actions(...)``
"""
return manage_documents(es, docs, action, indexname, documents, get_actions, silent)


def index_documents_data(es, docs, action, indexname, documents, silent=False):
"""
Wrapper for manage documents, preselecting ``get_actions_data(...)``
"""
return manage_documents(
es, docs, action, indexname, documents, get_actions_data, silent
)


def create_index(es, indexname, mapping=None):
"""
Create a index based upon the configured base indexname. It will increment an appendend 5 digit number to version the different index instances.
Args:
es : elasticsearch instance
indexname (str): configured base indexname
mapping (dict, optional): Dict with the index mapping. Defaults to the content of europarl/europarl_index.json.
Returns:
str: complete index name
"""
current_index = get_current_index(es, indexname)
if not current_index:
index = indexname + "-" + "00000"
Expand All @@ -82,6 +137,16 @@ def create_index(es, indexname, mapping=None):


def get_current_index(es, indexname):
"""
Return the current complete index name base upon the configured base index name.
Args:
es: elasticsearch instance
indexname (str): base index name
Returns:
str: complete index name
"""
indices = es.indices.resolve_index(indexname + "*")["indices"]
index_names = [index["name"] for index in indices]

Expand Down
Loading

0 comments on commit 77f02ed

Please sign in to comment.