Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Interface to solr Semantic Knowledge Graph #3

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions python-interface/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Explanation for running solr_skg.py on Semantic Knowledge Graph

Cuno Duursma
[email protected]

Tested on Windows 7 Python 2.7 Solr 5.1.0

schema.xml shoud be copied to:
semantic-knowledge-graph-master\deploy\solr\server\solr\knowledge-graph\conf\
If you change schema.xml, make sure to remove documents and restart solr.

Delete all (!) solr Knowlege Graph data (paste URL in browser):
http://localhost:8983/solr/knowledge-graph/update?stream.body=<delete><query>*:*</query></delete>
Commit delete (paste URL in browser):
http://localhost:8983/solr/knowledge-graph/update?stream.body=<commit/>

Restarting solr:
Open command window:
change to directory semantic-knowledge-graph-master\deploy\solr\server
execute:
java -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -jar start.jar --stop
Ports and Key may vary: see Solr console window in browser: http://localhost:8983/solr/#/
Go to semantic-knowledge-graph-master\deploy
source restart-solr.sh (e.g. from bash)
Just doing the restart using restart-solr.sh did not work for me.


Running solr_skg.py sould produce:

Knowledge Graph feed result: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">41</int></lst>
</response>

Knowledge Graph query:
{
"min_popularity": 0.0,
"compare": [
{
"sort": "relatedness",
"limit": 5,
"type": "col1",
"discover_values": "true"
}
],
"queries": [
"col1:\"whale\""
]
}
Knowledge Graph results:
{
"data": [
{
"values": [
{
"foreground_popularity": 400000.0,
"popularity": 400000.0,
"name": "whale",
"background_popularity": 400000.0,
"relatedness": 0.02618
},
{
"foreground_popularity": 200000.0,
"popularity": 200000.0,
"name": "arctic",
"background_popularity": 200000.0,
"relatedness": 0.0163
},
{
"foreground_popularity": 200000.0,
"popularity": 200000.0,
"name": "dolphin",
"background_popularity": 200000.0,
"relatedness": 0.0163
},
{
"foreground_popularity": 100000.0,
"popularity": 100000.0,
"name": "sea",
"background_popularity": 100000.0,
"relatedness": 0.01097
}
],
"type": "col1"
}
]
}

Issues:

- Notice that the popularity results are not converted correctly from JSON (e.g. 400000.0 should be 4.0)
- Relatedness figures are very low for a z-score

11 changes: 11 additions & 0 deletions python-interface/example.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
id,col1,col2,col3
1,dog cat,tone,ten
2,lion zebra,tone,ten
3,whale dolphin,tone,ten
4,swan goose,tone,ten
5,dog home,tone,ten
6,cat home,tone,ten
7,lion zebra zoo,tone,ten
8,whale dolphin sea,tone,ten
9,whale arctic,tone,ten
10,whale arctic,tone,ten
39 changes: 39 additions & 0 deletions python-interface/schema.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<schema name="blank" version="1.5">

<uniqueKey>id</uniqueKey>
<!-- Fits to example.csv-->
<fields>
<field name="id" type="keyword" indexed="true" stored="true" required="false" multiValued="false" />
<field name="col1" type="text" indexed="true" stored="false" required="false" multiValued="true" />
<field name="col2" type="keyword" indexed="true" stored="false" required="false" multiValued="true"/>
<field name="col3" type="keyword" indexed="true" stored="false" required="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>


<types>
<fieldType name="keyword-case-sens" class="solr.StrField" sortMissingLast="true">
</fieldType>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="keyword" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="&lt;[\/]{0,1}[a-zA-Z]+[\s]{0,1}[\/]{0,1}&gt;" replacement=" | "/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\,\;\:\?\!\\\/]\s" replacement=" | "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
</types>
</schema>

56 changes: 56 additions & 0 deletions python-interface/sorl_skg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# -*- coding: utf-8 -*-
"""
Created on Tue Jun 13 15:08:17 2017

@author: cduursma
[email protected]
Licenced under Apache License 2.0

Python interface to the Semantic Knowledge Graph in Solr
https://github.com/careerbuilder/semantic-knowledge-graph
Tested on Windows 7 Python 2.7 Solr 5.1.0

"""

import requests, json

# Global Knowledge Graph Query settings
url_query = "http://localhost:8983/solr/knowledge-graph/rel"
url_update = "http://localhost:8983/solr/knowledge-graph/update"
headers_query = {"content-type": "application/json", "Accept-Charset": "UTF-8"}
headers_update = {'Content-type': 'text/csv',"Accept-Charset": "UTF-8"}
params_update = {"commit": "true"}
data_update = open("rr_total2.csv", "rb").read()

# Example finding "five" in "col1"
query_content = {"queries":["col1:\"whale\""],
"min_popularity":0.0,
"compare":[{"type":"col1", "limit":5, "sort":"relatedness", "discover_values": "true"}]}


def feed_skg(data_update):
"""Feeds the knowledge graph with data. Data must be a binary openened file matching Knowledeg Graph schema.xml"""
rf = requests.get(url_update, params=params_update, headers=headers_update, data=data_update)
return(rf)


def query_skg(query):
"""Queries the knowledge graph with query. Query must be a Python set representing JSON object"""
rq = requests.post(url_query, headers=headers_query, json=query)
return(rq)


if __name__ == '__main__':
data_update = open("example.csv", "rb").read()
rf=feed_skg(data_update)
print("Knowledge Graph feed result: {0}".format(rf.text))
parsed_query=json.loads(json.dumps(query_content, indent=2, sort_keys=False))
print("Knowledge Graph query:")
print(json.dumps(query_content, indent=2))
rq=query_skg(query_content)
parsed=json.loads(rq.text)
print("Knowledge Graph results:")
print(json.dumps(parsed, indent=2, sort_keys=False))