-
Notifications
You must be signed in to change notification settings - Fork 4
Description
There's about a 15% speedup by tweaking the data classes.
CPython's __slots__ is a bit of a hack which makes an instance attribute faster to look up, and it makes the instance more compact. It is not in the Python 3.7 dataclass decorator (see https://www.python.org/dev/peps/pep-0557/#support-for-automatically-setting-slots ). It can be added manually, which is a recommended workaround. Doing that gives a benefit to your Python benchmark, improving it from the original:
# 36077.7910000003 μs
@dataclass
class Vertex:
x: float
y: float
z: float
to
# 32817.94299999996 μs
@dataclass
class Vertex:
__slots__ = ("x", "y", "z")
x: float
y: float
z: float
Then, for a reason I don't understand, the default __init__ adds measurable though small overhead compared to a manual one.
# 30137.319999997915 μs
@dataclass
class Vertex:
__slots__ = ("x", "y", "z")
x: float
y: float
z: float
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
However, a manual __init__ seems wrong given the goal of the dataclass decorator.
If I add a __slots__ = ("normal", "v1", "v2", "v3") to the Triangle class, the timing drops further, to 28504 μs.
There are a couple of microoptimizations which improved things by a couple of percent, but not enough to warrant them being considered in this benchmark.
ctypes alternative
One way to get better performance is to use the ctypes module from the standard library. The following takes about 124 μs:
import struct
import timeit
import ctypes
class Vertex(ctypes.Structure):
_pack_ = 4
_fields_ = [("x", ctypes.c_float),
("y", ctypes.c_float),
("z", ctypes.c_float)]
class Triangle(ctypes.Structure):
_pack_ = 2
_fields_ = [("normal", Vertex),
("v1", Vertex),
("v2", Vertex),
("v3", Vertex),
("_ignore", ctypes.c_short)]
def parse(path: str):
with open(path, 'rb') as stl:
stl.seek(80) # skip header
trianglecount = struct.unpack('I', stl.read(4))[0]
buffer_size = 50 * trianglecount
s = stl.read(buffer_size)
assert len(s) == (buffer_size), (len(s), buffer_size)
return (Triangle*trianglecount).from_buffer_copy(s)
def benchmark():
triangles = parse('nist.stl')
## print("blah", sum(triangle.normal.x + triangle.v1.y + triangle.v2.z + triangle.v3.x
## for triangle in triangles))
time = min(timeit.Timer(benchmark).repeat(number=1, repeat=500)) * 1e6
print(str(time) + " μs")
It's a bit of a cheat as there isn't any object instantiation. If I uncomment the test code, the benchmark time goes to 6881 μs. If I compromise and instead Triangle instances but on-demand Vertex instances, using return list((Triangle*trianglecount).from_buffer_copy(s)) then the parse time goes to 1560 μs and the benchmark+test code only slightly increases to 7000 μs.
NumPy alternative
If you're willing to give up the attribute accession API, another option is to use NumPy, and bring the timing down to 80 μs. With structured types I can reference triangles[10].v1.y as triangles[10]["v1"]["y"]. However, I don't think this is acceptable for what you are looking for.
import numpy as np
import struct
import timeit
point = [("x", np.float32), ("y", np.float32), ("z", np.float32)]
triangle_fields = np.dtype([
("normal", point),
("v1", point),
("v2", point),
("v3", point),
("ignore", "2S")
])
def parse(path: str):
with open(path, 'rb') as stl:
stl.seek(80) # skip header
trianglecount = struct.unpack('I', stl.read(4))[0]
s = stl.read(50 * trianglecount)
assert len(s) == (50 * trianglecount), (len(s), 50 * trianglecount)
return np.frombuffer(s, triangle_fields, count=trianglecount)
def benchmark():
triangles = parse('nist.stl')
## print("blah", sum(triangle["normal"]["x"] + triangle["v1"]["y"] + triangle["v2"]["z"] + triangle["v3"]["x"]
## for triangle in triangles))
time = min(timeit.Timer(benchmark).repeat(number=1, repeat=500)) * 1e6
print(str(time) + " μs")