-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison to SortedContainers.SortedList #1
Comments
Thanks! I wasn't aware of Your implementation is in essence a 2-level B-tree with a leaf page size of 1000, correct? And you maintain a separate index for positional access (it wasn't immediately clear to me how that worked). About performance... I was mostly interested in Big-O performance. Having a small constant factor between my and your implementation doesn't bother me that much. That said, I don't want to back down from a nice challenge.. .:) The reason I did a small test where I implemented the inner loop for
As you can see SortedContainers is about 2x slower here. Also on PyPy I would expect PySkipList to be very fast without a C extension, as I expect the JIT to specialize and inline most the inner loops. C implementation is below. Right now it only contains the I may add this to PySkipList, but I'm not sure yet. The inner loops for insertion and removal are a bit more messy. /*
* This file is part of PySkiplist. PySkiplist is Copyright (c) 2012-2015 by
* the PySkiplist authors.
*
* PySkiplist is free software available under the MIT license. See the file
* named LICENSE distributed with this file for the exact licensing terms.
*/
#include <Python.h>
#if PY_MAJOR_VERSION != 3
# error "This module is for Python 3 only."
#endif
static PyObject *
_skiplist_find_lt(PyObject *self, PyObject *args)
{
PyObject *Phead, *Ptail, *Ppath, *Pdistance, *Pkey, *Pnode, *Pnnode;
long i, level, distance;
if (!PyArg_ParseTuple(args, "O!O!O!O!lO:find_lt",
&PyList_Type, &Phead, &PyList_Type, &Ptail,
&PyList_Type, &Ppath, &PyList_Type, &Pdistance,
&level, &Pkey))
return NULL;
Pnode = Phead;
distance = 0;
for (i=level-1; i>=0; i--) {
Pnnode = PyList_GET_ITEM(Pnode, 2+i);
while ((Pnnode != Ptail) &&
PyObject_RichCompareBool(PyList_GET_ITEM(Pnnode, 0), Pkey, Py_LT)) {
Pnode = Pnnode;
Pnnode = PyList_GET_ITEM(Pnnode, 2+i);
distance += i ? PyLong_AS_LONG(PyList_GET_ITEM(Pnode, Py_SIZE(Pnode)-1)) : 1;
}
Py_INCREF(Pnode);
if (PyList_SetItem(Ppath, i, Pnode) < 0)
return NULL;
if (PyList_SetItem(Pdistance, i, PyLong_FromLong(distance)) < 0)
return NULL;
}
Py_INCREF(Py_None);
return Py_None;
}
static PyMethodDef _skiplist_methods[] =
{
{"find_lt", (PyCFunction) _skiplist_find_lt, METH_VARARGS},
{NULL, NULL}
};
PyDoc_STRVAR(_skiplist_doc, "PySkipList accelerator methods");
static struct PyModuleDef _skiplist_module = {
PyModuleDef_HEAD_INIT,
"_skiplist",
_skiplist_doc,
-1,
_skiplist_methods
};
PyMODINIT_FUNC PyInit__skiplist(void)
{
return PyModule_Create(&_skiplist_module);
} To use it, you need this patch: diff --git a/pyskiplist/skiplist.py b/pyskiplist/skiplist.py
index 5285ef8..5d396d5 100644
--- a/pyskiplist/skiplist.py
+++ b/pyskiplist/skiplist.py
@@ -13,6 +13,8 @@ import sys
import math
import random
+from . import _skiplist
+
__all__ = ['SkipList']
@@ -362,7 +364,8 @@ class SkipList(object):
If the key was not found, return *default*. If no default was provided,
return ``None``. This method never raises a ``KeyError``.
"""
- self._find_lt(key)
+ _skiplist.find_lt(self._head, self._tail, self._path, self._distance,
+ self._level, key)
node = self._path[0][2]
if node is self._tail or key < node[0]:
return default
diff --git a/setup.cfg b/setup.cfg
index 8bd5075..70a0349 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,3 +1,6 @@
+[build_ext]
+inplace=1
+
[bdist_wheel]
universal=1
diff --git a/setup.py b/setup.py
index 484130c..510bcdc 100644
--- a/setup.py
+++ b/setup.py
@@ -7,7 +7,7 @@
from __future__ import absolute_import, print_function
-from setuptools import setup
+from setuptools import setup, Extension
version_info = {
@@ -32,7 +32,11 @@ version_info = {
def main():
- setup(packages=['pyskiplist'], **version_info)
+ setup(
+ packages=['pyskiplist'],
+ ext_modules=[Extension('pyskiplist._skiplist', ['pyskiplist/_skiplist.c'])],
+ **version_info
+ )
if __name__ == '__main__': |
Wow. So great to get a positive response and a counter-challenge. That's right. SortedContainers uses a 2-level B-tree (which is really a half-hearted B-tree) with leaf size of 1,000. The indexing is hard to explain. I probably left the best notes in the source at: Can you try benchmarking the SortedContainers version again but with this def search(self, key, default=None):
_list = self._list
_maxes = _list._maxes
if not _maxes:
return default
pos = bisect_left(_maxes, key)
if pos == len(_maxes):
return default
idx = bisect_left(_list._keys[pos], key)
pair = _list._lists[pos][idx]
if key == pair[0]:
return pair
else:
return default This method uses some knowledge of the SortedList internals and skips the numeric indexing. It also inlines the
I think that would put it ahead of your C implementation. Let me know your local results. I also wouldn't put so much emphasis on big-O as it can be deceiving. I'd bet SortedContainers does still better on PyPy mainly because of memory utilization. Your skip-list adds a node for every inserted item which you implement as a list. SortedContainers just adds a pointer to the object. I think that's why SortedContainers has an increasing lead on the |
Latest results. I did some more optimization on the skiplist side. I also ran the benchmark up to 10m nodes to see what would happen.
It's quite interesting to me how well SortedContainers holds up to 10m nodes, even if at that point the load factor of 1,000 is too small. SortedContaniers is a fine implementation indeed! I think your insight is correct: modern processors can do a Essentially, memory has become like rotating storage again :) I'll put a note on the pyskiplist page. I think that at this point users should definitely consider using SortedCollections instead. The only (small) thing for me is that when storing key/value tuples in a When SortedContainers reaches 1.0, I think you should consider submitting it for inclusion in the standard library. Sorted data types are obviously missing. And as this thread shows, it is not immediately obvious what the best solution is. |
The performance of SortedContainers at scale was a pleasant surprise to me too. That it also does so in pure-Python is one of those lessons about "programming in Python" versus "programming the interpreter". The bisect module exposes a surprisingly powerful and fast set of interpreter commands. Thanks for being willing to put a note on the pyskiplist page. I've been trying to build support around SortedContainers and these steps are just the ones toward inclusion in the standard library. You're support joins Mark Summerfield and his sorteddict project. You raise a good point about the design of SortedListWithKey. I had originally wanted to implement your pyskiplist using a SortedDict but colliding keys would break indexing. The SortedListWithKey actually already stores the keys in a parallel list of lists as the values. You wouldn't have to override too many methods to get your version working. For example, def add(self, val):
"""Add the element *val* to the list."""
_maxes, _lists, _keys = self._maxes, self._lists, self._keys
key = self._key(val)
if _maxes:
pos = bisect_right(_maxes, key)
if pos == len(_maxes):
pos -= 1
_maxes[pos] = key
_keys[pos].append(key)
_lists[pos].append(val)
else:
idx = bisect_right(_keys[pos], key)
_keys[pos].insert(idx, key)
_lists[pos].insert(idx, val)
self._expand(pos)
else:
_maxes.append(key)
_keys.append([key])
_lists.append([val])
self._len += 1 But it could instead work like: def add(self, key, val):
"""Add the element *val* to the list with *key*."""
_maxes, _lists, _keys = self._maxes, self._lists, self._keys
if _maxes:
pos = bisect_right(_maxes, key)
if pos == len(_maxes):
pos -= 1
_maxes[pos] = key
_keys[pos].append(key)
_lists[pos].append(val)
else:
idx = bisect_right(_keys[pos], key)
_keys[pos].insert(idx, key)
_lists[pos].insert(idx, val)
self._expand(pos)
else:
_maxes.append(key)
_keys.append([key])
_lists.append([val])
self._len += 1 The only difference there is the addition of the I'm not eager to add a bunch of The diversity of the problem space is one reason Python doesn't have sorted container types. The current api originated from the blist module and adapts the Thanks for your dialogue on this topic. |
Thanks for adding the SortedContainers reference to your README. |
Have you seen the Python SortedContainers module? It's a fast, pure-Python implementation of SortedList, SortedSet, and SortedDict data types. It also has an extensive performance comparison with modules related to yours.
I was interested in your performance claims and I wanted to benchmark SortedContainers against your skip lists. But a SortedList is not quite the same as a SkipList. So I wrote a little wrapper and did a benchmark comparison (using your
perf_skiplist.py
. Here's my results:I'm not sure if you published this because the SkipList data structure is awesome (which it is) or because you need this in a production environment where performance matters. In the latter case, I wanted to provide this wrapper for you and get your thoughts. Here's the wrapper source (Apache2 License):
If you're interested in making this even faster, let me know. There's a few things that could be done.
The text was updated successfully, but these errors were encountered: