diff --git a/.gitignore b/.gitignore index 7c26b16..0eeddf6 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ +config_centillion.py config_flask.py vp credentials.json diff --git a/Readme.md b/Readme.md index 272c698..a8be9c7 100644 --- a/Readme.md +++ b/Readme.md @@ -6,10 +6,10 @@ one centillion is 3.03 log-times better than a googol. -![Screen shot of centillion](docs/images/ss.png) +![Screen shot: centillion search](docs/images/search.png) -## what is it +## What Is It Centillion (https://github.com/dcppc/centillion) is a search engine that can index three kinds of collections: Google Documents (.docx files), Github issues, and Markdown files in @@ -25,14 +25,43 @@ defined in `centillion.py`. The centillion keeps it simple. -## authentication layer +## Authentication Layer Centillion lives behind a Github authentication layer, implemented with [flask-dance](https://github.com/singingwolfboy/flask-dance). When you first visit the site it will ask you to authenticate with Github so that it can verify you have permission to access the site. -## technologies +![Screen shot: centillion authentication](docs/images/auth.png) + +## Master List + +There is a master list of all content indexed by centilion at the master list page, +. + +A master list for each type of document indexed by the search engine is displayed +in a table: + +![Screen shot: centillion master list](docs/images/master_list.png) + +The metadata shown in these tables can be filtered and sorted: + +![Screen shot: centillion master list with sorting](docs/images/master_list2.png) + +## Control Panel + +There's also a control panel at +that allows you to rebuild the search index from scratch. The search index +stores versions/contents of files locally, so re-indexing involves going out and +asking each API for new versions of a file/document/web page. When you re-index +the main search index, it will ask every API for new versions of every document. +You can also update only specific types of documents in the search index. + +![Screen shot: centillion control panel](docs/images/control_panel.png) + + + +## Technologies Centillion is a Python program built using whoosh (search engine library). It indexes the full text of docx files in Google Documents, just the filenames for @@ -41,16 +70,61 @@ results are grouped by issue. Centillion requires Google Drive and Github OAuth apps. Once you provide credentials to Flask you're all set to go. -## control panel +## Configuration -There's also a control panel at -that allows you to rebuild the search index from scratch (the Google Drive indexing -takes a while). +You will need to configure both the centillion search index and the flask app. + +The centillion search index is configured with `config_centillion.py`; this file +sets the names of repositories to crawl when indxing issues and files. + +The flask app is configured with `config_flask.py`. This file contains sensitive +information and is in the `.gitignore` file. This file contains API credentials +for Github and Groups.io. + +Exampls are provided in `config_centillion.example.py` and `config_flask.example.py`. + + +## Authentication + +The search engine will need to connect to several APIs when it re-indexes the +search index: + +* Github +* Groups.io +* Google Drive + +### Github + +Github API credentials (both an OAuth token for the centillion app's Github +authentication mechanism, and a personal access token for accessing repositories +during the re-indexing process) are provided in `config_flask.py`. + +### Groups.io + +The Groups.io API token is used to index email threads. This token is provided in +`config_flask.py`. + +### Google Drive + +The Google Drive API credentials are provided in a file, `credentials.json`. This is +the file that is generated when the OAuth process is complete. + +When you enable the Google Drive API in the Google Cloud Console, you will be provided +with a file `client_secrets.json`. To authenticate centillion with Google Drive, you should +download this file, and run the Google Drive utility directly: + +``` +python gdrive_util.py +``` -![Screen shot of centillion control panel](docs/images/cp.png) +This will initiate the authentication procedure. Sign in as a user that has access to +the documents you want to index, and _only_ the documents you want to index (it is useful +to set up a bot account for this purpose). +Once you log in as that user, it will create `credentials.json`, and the Google Drive +re-indexing procedure should not have any problems autheticating using that file. -## quickstart (with Github auth) +## Quickstart (With Github Auth) Start by creating a Github OAuth application. Get the public and private application key @@ -85,7 +159,7 @@ This will start a Flask server, and you can view the minimal search engine interface in your browser at `http://:5000`. -## troubleshooting +## Troubleshooting If you are having problems with your callback URL being treated as HTTP by Github, even though there is an HTTPS address, and diff --git a/config_centillion.py b/config_centillion.example.py similarity index 86% rename from config_centillion.py rename to config_centillion.example.py index 8615f78..253cd81 100644 --- a/config_centillion.py +++ b/config_centillion.example.py @@ -23,13 +23,6 @@ "dcppc/design-guidelines", "dcppc/2018-may-workshop", "dcppc/centillion" - ], - "github_ignore_files_re" : [ - '^\.*', - '^_*' - ], - "github_ignore_dirs_re" : [ - '^_*' ] } diff --git a/docs/images/auth.png b/docs/images/auth.png new file mode 100644 index 0000000..d66eac6 Binary files /dev/null and b/docs/images/auth.png differ diff --git a/docs/images/control_panel.png b/docs/images/control_panel.png new file mode 100644 index 0000000..873b5cd Binary files /dev/null and b/docs/images/control_panel.png differ diff --git a/docs/images/cp.png b/docs/images/cp.png deleted file mode 100644 index a2972d9..0000000 Binary files a/docs/images/cp.png and /dev/null differ diff --git a/docs/images/master_list.png b/docs/images/master_list.png new file mode 100644 index 0000000..31971fa Binary files /dev/null and b/docs/images/master_list.png differ diff --git a/docs/images/master_list2.png b/docs/images/master_list2.png new file mode 100644 index 0000000..2043dba Binary files /dev/null and b/docs/images/master_list2.png differ diff --git a/docs/images/search.png b/docs/images/search.png new file mode 100644 index 0000000..5d8a397 Binary files /dev/null and b/docs/images/search.png differ diff --git a/docs/images/ss.png b/docs/images/ss.png deleted file mode 100644 index d87ce8b..0000000 Binary files a/docs/images/ss.png and /dev/null differ diff --git a/static/centillion_master_list.js b/static/centillion_master_list.js index efb11ac..2fcc93f 100644 --- a/static/centillion_master_list.js +++ b/static/centillion_master_list.js @@ -110,12 +110,29 @@ function load_gdoc_table(){ r[++j] = ''; } r[++j] = '' - $('#gdocs-master-list').html(r.join('')); - $('#gdocs-master-list').DataTable({ + + // Construct names of id tags + var doctype = 'gdocs'; + var idlabel = '#' + doctype + '-master-list'; + var filtlabel = idlabel + '_filter'; + + // Initialize the DataTable + $(idlabel).html(r.join('')); + $(idlabel).DataTable({ responsive: true, lengthMenu: [50,100,250,500] }); - initGdocTable = true; + + // Get the search filter section and search box + var searchsec = $(filtlabel).find('label'); + var searchbox = searchsec.find('input'); + + // Replace search filter section text, + // then re-add the removed search box + searchsec.text('Search Metadata: '); + searchsec.append(searchbox); + + initGdocTable = true }); console.log('Finished loading Google Drive master list'); } @@ -160,11 +177,28 @@ function load_issue_table(){ r[++j] = ''; } r[++j] = '' - $('#issues-master-list').html(r.join('')); - $('#issues-master-list').DataTable({ + + // Construct names of id tags + var doctype = 'issues'; + var idlabel = '#' + doctype + '-master-list'; + var filtlabel = idlabel + '_filter'; + + // Initialize the DataTable + $(idlabel).html(r.join('')); + $(idlabel).DataTable({ responsive: true, lengthMenu: [50,100,250,500] }); + + // Get the search filter section and search box + var searchsec = $(filtlabel).find('label'); + var searchbox = searchsec.find('input'); + + // Replace search filter section text, + // then re-add the removed search box + searchsec.text('Search Metadata: '); + searchsec.append(searchbox); + initIssuesTable = true; }); console.log('Finished loading Github issues master list'); @@ -206,11 +240,28 @@ function load_ghfile_table(){ r[++j] = ''; } r[++j] = '' - $('#ghfiles-master-list').html(r.join('')); - $('#ghfiles-master-list').DataTable({ + + // Construct names of id tags + var doctype = 'ghfiles'; + var idlabel = '#' + doctype + '-master-list'; + var filtlabel = idlabel + '_filter'; + + // Initialize the DataTable + $(idlabel).html(r.join('')); + $(idlabel).DataTable({ responsive: true, lengthMenu: [50,100,250,500] }); + + // Get the search filter section and search box + var searchsec = $(filtlabel).find('label'); + var searchbox = searchsec.find('input'); + + // Replace search filter section text, + // then re-add the removed search box + searchsec.text('Search Metadata: '); + searchsec.append(searchbox); + initGhfilesTable = true; }); console.log('Finished loading Github file list'); @@ -234,7 +285,7 @@ function load_markdown_table(){ r[++j] = '' r[++j] = ''; r[++j] = 'Markdown File Name'; - r[++j] = 'Repo'; + r[++j] = 'Repository'; r[++j] = ''; r[++j] = '' r[++j] = '' @@ -250,11 +301,28 @@ function load_markdown_table(){ r[++j] = ''; } r[++j] = '' - $('#markdown-master-list').html(r.join('')); - $('#markdown-master-list').DataTable({ + + // Construct names of id tags + var doctype = 'markdown'; + var idlabel = '#' + doctype + '-master-list'; + var filtlabel = idlabel + '_filter'; + + // Initialize the DataTable + $(idlabel).html(r.join('')); + $(idlabel).DataTable({ responsive: true, lengthMenu: [50,100,250,500] }); + + // Get the search filter section and search box + var searchsec = $(filtlabel).find('label'); + var searchbox = searchsec.find('input'); + + // Replace search filter section text, + // then re-add the removed search box + searchsec.text('Search Metadata: '); + searchsec.append(searchbox); + initMarkdownTable = true; }); console.log('Finished loading Markdown list'); @@ -293,14 +361,32 @@ function load_emailthreads_table(){ r[++j] = ''; } r[++j] = '' - $('#emailthreads-master-list').html(r.join('')); - $('#emailthreads-master-list').DataTable({ + + // Construct names of id tags + var doctype = 'emailthreads'; + var idlabel = '#' + doctype + '-master-list'; + var filtlabel = idlabel + '_filter'; + + // Initialize the DataTable + $(idlabel).html(r.join('')); + $(idlabel).DataTable({ responsive: true, lengthMenu: [50,100,250,500] }); - initEmailthreadsTable = true + + // Get the search filter section and search box + var searchsec = $(filtlabel).find('label'); + var searchbox = searchsec.find('input'); + + // Replace search filter section text, + // then re-add the removed search box + searchsec.text('Search Metadata: '); + searchsec.append(searchbox); + + initEmailthreadsTable = true; }); console.log('Finished loading Groups.io email threads list'); } } } + diff --git a/static/style.css b/static/style.css index 322799e..c5e50e4 100755 --- a/static/style.css +++ b/static/style.css @@ -1,3 +1,7 @@ +.btn-reindex-type, .btn-reindex-all { + width: 350px; +} + #github-button { display:inline-block; font-size: 20px; diff --git a/templates/controlpanel.html b/templates/controlpanel.html index 0fbc972..ca277ff 100755 --- a/templates/controlpanel.html +++ b/templates/controlpanel.html @@ -20,17 +20,9 @@

Re-index every document in the remote collection in the search index. Warning: this operation may take a while. -

-

Update Main Index -

-

Update Github Files Index -

-

Update Github Issues Index -

-

Update Google Drive Index -

-

Update Groups.io Email Threads Index -

+

+

Update Main Index +

@@ -38,30 +30,36 @@

- {# update diff search index #} - {# + + + {# update search index by type #}

- Update Diff Search Index + Update Search Index by Type

-

Diff search index only re-indexes documents created after the last - search index update. Not currently implemented. -

- Update Diff Index -

+

Re-index individual document types in the search index. +

+

Update Google Drive Index +

+

Update Github Files Index +

+

Update Github Issues Index +

+

Update Groups.io Email Threads Index +

- #} + diff --git a/templates/masterlist.html b/templates/masterlist.html index a5a7c81..e26a803 100755 --- a/templates/masterlist.html +++ b/templates/masterlist.html @@ -12,7 +12,7 @@ #}
-
+