Skip to content

Commit 6c0d081

Browse files
committedMay 6, 2020
New site
1 parent 67709ca commit 6c0d081

29 files changed

+4435
-36
lines changed
 

‎README.md

+36-36
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,37 @@
1-
## Welcome to GitHub Pages
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
LumoSQL
9+
=======
10+
11+
![](./images/lumo-logo-temp.svg "LumoSQL logo")
12+
13+
14+
Table of Contents
15+
=================
16+
17+
Welcome to the LumoSQL project, which builds on the excellent
18+
[SQLite](https://sqlite.org/) project without forking it. LumoSQL is an SQL database
19+
which can be used in embedded applications identically to SQLite, but also
20+
optionally with different storage backends and other additional behaviour.
21+
LumoSQL emphasises benchmarking, code reuse and modern database implementation.
22+
23+
* [Quick Start](./lumo-quickstart.md)
24+
* [LumoSQL Project Aims](./lumo-project-aims.md)
25+
* Creating a LumoSQL Ecosystem
26+
+ [The LumoSQL Landscape](./lumo-landscape.md)
27+
+ [Codebases relevant to LumoSQL](./lumo-relevant-codebases.md)
28+
+ [Full Knowledgebase Relevant to LumoSQL](./lumo-relevant-knowledgebase.md)
29+
* LumoSQL in Technical Detail
30+
+ [Architecture](./lumo-architecture.md)
31+
+ [Implementation](./lumo-implementation.md)
32+
* [Not-forking Scheme](./lumo-not-forking.md)
33+
* [Corruption Detection and Magic](./lumo-corruption-detection-and-magic.md)
34+
* [Benchmarking](./lumo-benchmarking.md)
35+
* [Legal Aspects](./lumo-legal-aspects.md)
36+
* [LumoSQL Documentation Standards](./lumo-doc-standards.md)
237

3-
You can use the [editor on GitHub](https://github.com/LumoSQL/lumosql.github.io/edit/master/README.md) to maintain and preview the content for your website in Markdown files.
4-
5-
Whenever you commit to this repository, GitHub Pages will run [Jekyll](https://jekyllrb.com/) to rebuild the pages in your site, from the content in your Markdown files.
6-
7-
### Markdown
8-
9-
Markdown is a lightweight and easy-to-use syntax for styling your writing. It includes conventions for
10-
11-
```markdown
12-
Syntax highlighted code block
13-
14-
# Header 1
15-
## Header 2
16-
### Header 3
17-
18-
- Bulleted
19-
- List
20-
21-
1. Numbered
22-
2. List
23-
24-
**Bold** and _Italic_ and `Code` text
25-
26-
[Link](url) and ![Image](src)
27-
```
28-
29-
For more details see [GitHub Flavored Markdown](https://guides.github.com/features/mastering-markdown/).
30-
31-
### Jekyll Themes
32-
33-
Your Pages site will use the layout and styles from the Jekyll theme you have selected in your [repository settings](https://github.com/LumoSQL/lumosql.github.io/settings). The name of this theme is saved in the Jekyll `_config.yml` configuration file.
34-
35-
### Support or Contact
36-
37-
Having trouble with Pages? Check out our [documentation](https://help.github.com/categories/github-pages-basics/) or [contact support](https://github.com/contact) and we’ll help you sort it out.

‎images/lumo-architecture-intro.jpg

553 KB
Loading

‎images/lumo-architecture-lumosql-theoretical-future.svg

+160
Loading

‎images/lumo-architecture-online-db-server-scale.svg

+208
Loading

‎images/lumo-architecture-online-db-server.svg

+157
Loading

‎images/lumo-architecture-sqlite-overview.svg

+158
Loading
+111
Loading

‎images/lumo-diagram-library.odg

20.7 KB
Binary file not shown.

‎images/lumo-doc-standards-intro.jpg

85 KB
Loading

‎images/lumo-ecosystem-intro.png

190 KB
Loading

‎images/lumo-implementation-intro.jpg

188 KB
Loading

‎images/lumo-logo-temp.svg

+288
Loading

‎images/lumo-logo.png

3.66 KB
Loading

‎images/lumo-project-aims-intro.jpg

93 KB
Loading
4.42 MB
Loading

‎images/lumo-signature.svg

+599
Loading

‎lumo-architecture.md

+219
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [LumoSQL Architecture](#lumosql-architecture)
12+
* [Table of Contents](#table-of-contents)
13+
* [Online Database Servers](#online-database-servers)
14+
* [SQLite as an Embedded Database](#sqlite-as-an-embedded-database)
15+
* [LumoSQL Architecture](#lumosql-architecture-1)
16+
* [Database Storage Systems](#database-storage-systems)
17+
* [WALs in SQLite](#wals-in-sqlite)
18+
* [Single-level Store](#single-level-store)
19+
20+
LumoSQL Architecture
21+
====================
22+
23+
![](./images/lumo-architecture-intro.jpg "Shanghai Skyline from Pxfuel, CC0 license, https://www.pxfuel.com/en/free-photo-oyvbv")
24+
25+
26+
# Online Database Servers
27+
28+
All of the most-used databases other than SQLite work over a network, here
29+
called "online databases". This includes Postgresql, MariaDB, MySQL, SQLServer,
30+
Oracle, and so on.
31+
32+
![](./images/lumo-architecture-online-db-server.svg "What an online server database looks like")
33+
34+
An online database server has clients that connect to the server over a
35+
network. Once a network connection is opened, SQL queries are made by the
36+
client and data is returned from the server. Although all databases use one of
37+
the variants of the same SQL language, the means of connection is specific to each
38+
database.
39+
40+
For example, on a typical Debian Linux server there are these well-known ports:
41+
42+
```
43+
foo@zanahoria:/etc$ grep sql /etc/services
44+
45+
ms-sql-s 1433/tcp # Microsoft SQL Server
46+
ms-sql-m 1434/tcp # Microsoft SQL Monitor
47+
mysql 3306/tcp # MySQL
48+
postgresql 5432/tcp # PostgreSQL Database
49+
mysql-proxy 6446/tcp # MySQL Proxy
50+
```
51+
52+
with many other port assignments for other databases.
53+
54+
In the diagram above, each UserApp has a network connection to the SQL Database
55+
Server on TCP port, for example 5432 if it is Postgresql. The UserApps could be
56+
running from anywhere on the internet, including on mobile devices. There is a
57+
limit to how many users one single database server can serve, in the many
58+
thousands at least, but often reached for internet applications.
59+
60+
![](./images/lumo-architecture-online-db-server-scale.svg "How an online database server scales")
61+
62+
The most obvious way to scale an online database is to add more RAM, CPU and storage to a single server. This way all code runs in a single address space and is called "Scaling Up". The alternative is to add more servers, and distribute queries between them. This is called "Scale Out".
63+
64+
Nati Shalom describes the difference in the [article Scale-Out vs Scale-Up](http://ht.ly/cAhPe):
65+
66+
> One of the common ways to best utilize multi-core architecture in a context
67+
> of a single application is through concurrent programming. Concurrent
68+
> programming on multi-core machines (scale-up) is often done through
69+
> multi-threading and in-process message passing also known as the Actor
70+
> model.Distributed programming does something similar by distributing jobs
71+
> across machines over the network. There are different patterns associated
72+
> with this model such as Master/Worker, Tuple Spaces, BlackBoard, and
73+
> MapReduce. This type of pattern is often referred to as scale-out
74+
> (distributed).
75+
>
76+
> Conceptually, the two models are almost identical as in both cases we break a
77+
> sequential piece of logic into smaller pieces that can be executed in
78+
> parallel. Practically, however, the two models are fairly different from an
79+
> implementation and performance perspective. The root of the difference is the
80+
> existence (or lack) of a shared address space. In a multi-threaded scenario
81+
> you can assume the existence of a shared address space, and therefore data
82+
> sharing and message passing can be done simply by passing a reference. In
83+
> distributed computing, the lack of a shared address space makes this type of
84+
> operation significantly more complex. Once you cross the boundaries of a
85+
> single process you need to deal with partial failure and consistency. Also,
86+
> the fact that you can’t simply pass an object by reference makes the process
87+
> of sharing, passing or updating data significantly more costly (compared with
88+
> in-process reference passing), as you have to deal with passing of copies of
89+
> the data which involves additional network and serialization and
90+
> de-serialization overhead.
91+
92+
# SQLite as an Database Library
93+
94+
The user applications are tightly connected to the SQLite library. Whether by
95+
dynamic linking to a copy of the library shared across the whole operating
96+
system, or static linking so that it is part of the same program as the user
97+
application, there is no networking involved. Making an SQL query and getting a
98+
response involves a cascade of function calls from the app to the library to
99+
the operating system and back again, typically taking less than 10 milliseconds
100+
at most depending on the hardware used. An online database cannot expect to get
101+
faster results than 100 milliseconds, often much more depending on network and
102+
hardware. And online database relies on the execution of hundreds of millions
103+
of more lines of code on at least two computers, whereas SQLite relies on the
104+
execution of some hundreds of thousand on just one computer.
105+
106+
![](./images/lumo-architecture-sqlite-overview.svg "Overview of a SQLite being an embedded database server")
107+
108+
109+
![](./images/lumo-architecture-sqlite-parts.svg "The simplest view of the three parts to SQLite in typical embedded use")
110+
111+
112+
# How LumoSQL Architecture Differs from SQLite
113+
114+
![](./images/lumo-architecture-lumosql-theoretical-future.svg "Where LumoSQL architecture is headed")
115+
116+
# Database Storage Systems
117+
118+
LumoSQL has several features that are in advance of every other
119+
widely-used database. With the first prototype complete with an LMDB backend,
120+
LumoSQL is already the first major SQL database to move away from batch
121+
processing, since it has a backend that does not use Write-Ahead Logs. LumoSQL
122+
also needs to be able to use both the original SQLite and additional storage
123+
mechanisms, and any or all of these storage backends at once. Not all future
124+
storage will be on local disk, or btree key-values.
125+
126+
[Write-ahead Logging in Transactional Databases](https://en.wikipedia.org/wiki/Write-ahead_logging) has been the only
127+
way since the 1990s that atomicity and durability are provided in
128+
databases. A version of same technique is used in filesystems, where is is
129+
called [journalling](https://en.wikipedia.org/wiki/Journaling_file_system).
130+
Write-ahead Logging (WAL) is a method of making sure that all modifications to
131+
a file are first written to a separate log, and then they are merged (or
132+
updated) into a master file in a later step. If this update operation is
133+
aborted or interrupted, the log has enough information to undo the updates and
134+
reset the database to the state before the update began. Implementations need
135+
to solve the problem of WAL files growing without bound, which means some kind
136+
of whole-database snapshot or checkpoint is required.
137+
138+
WALs seek to address issues with concurrent transactions, and reliability in
139+
the face of crashes or errors. There are decades of theory around how to
140+
implement WAL, and it is a significant part of any University course in
141+
database internals. As well as somewhat-reliable commit and rollback, it is the
142+
WAL that lets all the main databases in use offer online backup features, and
143+
point-in-time recovery. Every WAL feature and benefit comes down to being able
144+
to have a stream of atomic operations that can be replayed forwards or
145+
backwards.
146+
147+
WAL is inherently batch-oriented. The more a WAL-based database tries to be to
148+
real time, the more expensive it is to keep all WAL functionality working.
149+
150+
The WAL implementation in the most common networked databases is comprehensive
151+
and usually kept as a rarely-seen technical feature. Postgresql is an exception,
152+
going out of its way to inform administrators how the WAL system works and what
153+
can be done with access to the log files.
154+
155+
All the most common networked databases describe their WAL implementation and
156+
most offer some degree of control over it:
157+
158+
* [Postgresql](https://www.postgresql.org/docs/12/wal-intro.html)
159+
* [SQL Server](https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-log-architecture-and-management-guide?view=sql-server-ver15)
160+
* [Oracle Log Writer Process](https://docs.oracle.com/en/database/oracle/oracle-database/19/cncpt/process-architecture.html#GUID-B6BE2C31-1543-4504-9763-6FFBBF99DC85)
161+
* [MySQL ReDo Log](https://dev.mysql.com/doc/refman/8.0/en/optimizing-innodb-logging.html)
162+
* [MariaDB Undo Log](https://mariadb.com/kb/en/library/innodb-undo-log/)
163+
164+
Companies have invested billions of Euros into these codebases, with stability
165+
and reliability as their first goal. And yet even with all the runtime
166+
advantages of huge resources and stable datacentre environments - even these
167+
companies can't make WALs fully deliver on reliability.
168+
169+
These issues are well-described in the case of Postgresql. Postgresql has an
170+
easier task than SQLite in the sense it is not intended for unpredictable
171+
embedded use cases, and also that Postgresql has a large amount of code
172+
dedicated to safe WAL handling. Even so, Postgresql still requires its users
173+
to make compromises regarding reliability. For example [this WAL mitigation
174+
article](https://dzone.com/articles/postgresql- why-and-how-wal-bloats)
175+
describes a few of the tradeoffs of merge frequency vs reliability in the case
176+
of a crash. This is a very real problem for every traditional database and that
177+
includes SQLite - which does not have a fraction of the WAL-handling code of
178+
the large databases, and which is frequently deployed in embedded use cases
179+
where crashes and resets happen very frequently.
180+
181+
## WALs in SQLite
182+
183+
SQLite WALs are special.
184+
185+
The [SQLite WAL]( https://www.sqlite.org/draft/wal.html) requires multiple
186+
files to be maintained in synch, otherwise there will be corruption. Unlike the
187+
other databases listed here, SQLite has no pre-emptive corruption detection and
188+
only fairly basic on-demand detection.
189+
190+
## Single-level Store
191+
192+
Single-level store concepts are well-explained in [Howard Chu's 2013 MDB Paper](./lumo-relevant-knowledgebase.md#list-of-sqlite-code-related-knowledge):
193+
194+
> One fundamental concept behind the MDB approach is known as "Single-Level
195+
> Store". The basic idea is to treat all of computer memory as a single address
196+
> space. Pages of storage may reside in primary storage (RAM) or in secondary
197+
> storage (disk) but the actual location is unimportant to the application. If
198+
> a referenced page is currently in primary storage the application can use it
199+
> immediately, if not a page fault occurs and the operating system brings the
200+
> page into primary storage. The concept was introduced in 1964 in the Multics
201+
> operating system but was generally abandoned by the early 1990s as data
202+
> volumes surpassed the capacity of 32 bit address spaces. (We last knew of it
203+
> in the Apollo DOMAIN operating system, though many other Multics-influenced
204+
> designs carried it on.) With the ubiquity of 64 bit processors today this
205+
> concept can again be put to good use. (Given a virtual address space limit of
206+
> 63 bits that puts the upper bound of database size at 8 exabytes. Commonly
207+
> available processors today only implement 48 bit address spaces, limiting us
208+
> to 47 bits or 128 terabytes.) Another operating system requirement for this
209+
> approach to be viable is a Unified BufferCache. While most POSIX-based
210+
> operating systems have supported an mmap() system call for many years, their
211+
> initial implementations kept memory managed by the VM subsystem separate from
212+
> memory managed by the filesystem cache. This was not only wasteful
213+
> (again, keeping data cached in two places at once) but also led to coherency
214+
> problems - data modified through a memory map was not visible using
215+
> filesystem read() calls, or data modified through a filesystem write() was not
216+
> visible in the memory map. Most modern operating systems now have filesystem
217+
> and VM paging unified, so this should not be a concern in most deployments.
218+
219+

‎lumo-benchmarking.md

+345
Large diffs are not rendered by default.
+143
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
Table of Contents
8+
=================
9+
10+
* [Summary of SQL Database Corruption Detection](#summary-of-sql-database-corruption-detection)
11+
* [SQLite and Integrity Checking](#sqlite-and-integrity-checking)
12+
* [LumoSQL Checksums and the SQLite On-disk File Format](#lumosql-checksums-and-the-sqlite-on-disk-file-format)
13+
* [Design for Corruption Detection](#design-for-corruption-detection)
14+
15+
16+
![](./images/lumo-corruption-detection-and-magic-intro.png "XXXXXXXX")
17+
18+
# Summary of SQL Database Corruption Detection
19+
20+
One of the short-term goals stated in the [LumoSQL Project Aims](./lumo-project-aims.md) is:
21+
22+
> LumoSQL will improve SQLite quality and privacy compliance by introducing
23+
> optional on-disk checksums for storage backends including to the original
24+
> SQLite btree format. This will give real-time row-level corruption detection.
25+
26+
It seems quite extraordinary that in 2020 none of the major online databases -
27+
not Posgresql, Oracle, MariaDB, SQLServer or others - have the ability to check
28+
during a SELECT operation that the row being read from disk is exactly the row
29+
that was previously written. There are many reasons why data can get modified,
30+
deleted or overwritten outwith the control of the database, and the ideal way
31+
to respond to this is to notify the database when a corrupt row is accessed.
32+
All that is needed is for a hash of the row to be stored with the row when it
33+
is written.
34+
35+
All the major online databases have the capacity for an external process to
36+
check disk files for database corruption, as does SQLite. This is very
37+
different from real-time integrity checking, and cannot be done in real time.
38+
39+
Knowing that a corruption problem is limited to a row or an itemised
40+
list of rows reduces a general "database corruption problem" down to a bounded
41+
reconstruction task. Users can have confidence in the remainder of a database
42+
even if there is corruption found in some rows.
43+
44+
This problem has been recognised and solved inefficiently at the SQL level by various projects. Two of these are
45+
[Periscope Data's Per-table Multi-database Solution](https://www.periscopedata.com/blog/hashing-tables-to-ensure-consistency-in-postgres-redshift-and-mysql) and [Percona's Postgresql Public Key Row Tracking](https://www.percona.com/blog/2018/10/12/track-postgresql-row-changes-using-public-private-key-signing/). By using SQL code rather than modifying the database internals there is a performance hit. Both these companies specialise in performance optimisation but choose not to apply it to this feature, suggesting they are not convinced of high demand from users.
46+
47+
Interestingly, all the big online databases have row-level security, which has many similarities to the problem of corruption detection.
48+
49+
For those databases that offer encryption, this is effectively page-level or
50+
column-based hashes and therefore there is corruption detection by implication.
51+
However this is not row-based checksumming, and it is not on by default in any
52+
of the most common databases.
53+
54+
It is possible to introduce a checksum on database pages more easily than for
55+
every row, and transparently to database users. However, knowing a database
56+
page is corrupt isn't much help to the user, because there could be many rows
57+
in a single page.
58+
59+
# SQLite and Integrity Checking
60+
61+
The SQLite developers go to great lengths to avoid database corruption, within their project goals. Nevertheless, corrupted SQLite databases are an everyday occurance.
62+
63+
SQLite does have checksums already in some places:
64+
65+
* for the journal transaction log (superceded by the Write Ahead Log system)
66+
* for each database page when using the closed-source SQLite Encryption Extension
67+
* for each page in a WAL file
68+
69+
SQLite also has [PRAGMA integrity_check](https://www.sqlite.org/pragma.html#pragma_integrity_check) and
70+
[PRAGMA quick_check](https://www.sqlite.org/pragma.html#pragma_quick_check)
71+
which do partial checking, and which do not require exclusive access to the
72+
database. These checks have to scan the database file sequentially and verify
73+
the logic of its structure, because there are no checksums available to make it
74+
work more quickly.
75+
76+
None of these are even close to the accuracy, reliability and speed of row-level corruption detection.
77+
78+
SQLite does have a file change counter in its database header, in
79+
[offset 24 of the official file format](https://www.sqlite.org/fileformat.html), however this
80+
is not itself subject to integrity checks nor does it contain information about the rest of the file,
81+
so it is a hint rather than a guarantee.
82+
83+
SQLite needs row-level integrity checking even more than the online databases because:
84+
85+
* SQLite embedded and IoT use cases often involve frequent power loss, which is the most likely time for corruption to occur.
86+
* an SQLite database is an ordinary filesystem disk file stored wherever the user decided, which can often be deleted or overwritten by any unprivileged process.
87+
* it is easy to backup an SQLite database partway through a transaction, meaning that the restore will be corrupted
88+
* SQLite does not have robust locking mechanisms available for access by multiple processes at once, since it relies on lockfiles and Posix advisory locking
89+
* SQLite provides the [VFS API Interface](https://www.sqlite.org/vfs.html) which users can easily misuse to ignore locking via the sql3_*v2 APIs
90+
* the on-disk file format is seemingly often corrupted regardless of use case. Better evidence on this is needed but authors of SQLite data file recovery software (see listing in [SQLite Relevant Knowledgebase](./lumo-relevant-knowledebase)) indicates high demand for their services. Informal shows of hands at conferences indicates that SQLite users expect corruption.
91+
92+
sqlite.org has a much more detailed, but still incomplete, summary of [How to Corrupt an SQLite Database](https://www.sqlite.org/howtocorrupt.html).
93+
94+
# LumoSQL Checksums and the SQLite On-disk File Format
95+
96+
The SQLite database format is widely used as a defacto standard. LumoSQL ships
97+
with the lumo-backend-mdb-traditional which is the unmodified SQLite on-disk
98+
format, the same code generating the same data. There is no corruption
99+
detection included in the file format for this backend. However corruption
100+
detection is available for the traditional backend, and other backends that do
101+
not have scope for checksums in their headers. For all of these backends,
102+
LumoSQL offers a separate metadata file containing integrity information.
103+
104+
The new backend lumo-backend-mdb-updated adds row-level checksums in the header
105+
but is otherwise identical to the traditional SQLite MDB format.
106+
107+
There is an argument that any change at all is the same as having a completely
108+
different format. This is not a strong argument against adding checksums to
109+
the traditional SQLite on-disk format because with encryption increasingly
110+
becoming mandatory, the standard cannot apply. The sqlite.org closed-source SSE
111+
solution is described as "All database content, including the metadata, is
112+
encrypted so that to an outside observer the database appears to be white
113+
noise." Other solutions are possible involving metadata that is not encrypted
114+
(but definitely checksummed), but in any case, there is no on-disk standard for
115+
SQLite databases with encryption.
116+
117+
# Design for Corruption Detection
118+
119+
All LumoSQL backends can have corruption detection enabled, with the metadata
120+
stored either directly in the backend database files, or in a separate file.
121+
When a user switches on checksums for a database, metadata needs to be stored.
122+
123+
This depends on two new functions needed in any case for labelling LumoSQL
124+
databases provided by backend-magic.c: lumosql_set_magic() and lumosql_get_magic(). These functions add and
125+
read a unique metadata signature to a LumoSQL database.
126+
127+
1. if possible magic is inserted into the existing header
128+
129+
2. if not a separate "metadata" b-tree is created which contains a key "magic"
130+
and the appropriate value. get_magic() will look for the special metadata
131+
b-tree and the "magic" key
132+
133+
After LumoSQL has determined how and where metadata will be stored, the high-level design for row-level checksums is:
134+
135+
1. an internally maintained row hash updated with every change to a row
136+
2. If a corruption is detected on read, LumoSQL should make maximum relevant fuss. At minimum, [error code 11 is SQLITE_CORRUPT](https://www.sqlite.org/rescode.html#corrupt)
137+
3. An additional SQL user command is added that exposes this hash in a column so that user-level logic can do not only corruption detection, but also change detection.
138+
139+
At a later stage a column checksum can be added giving change detection on a table, or corruption detection for read-only tables.
140+
141+
In the case where there is a separate metadata file, a function pair in lumo-backend-magic.c reads and writes a whole-of-file checksum for the database. This can't be done for where metadata is stored in the main database file.
142+
143+

‎lumo-doc-standards.md

+290
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [LumoSQL Documentation Standards](#lumosql-documentation-standards)
12+
* [Contributions to LumoSQL Documentation are Welcome](#contributions-to-lumosql-documentation-are-welcome)
13+
* [LumoSQL Respects Documentation for SQLite, LMDB and More](#lumosql-respects-documentation-for-sqlite-lmdb-and-more)
14+
* [Text Standards and Tools](#text-standards-and-tools)
15+
* [Diagram Standards and Tools](#diagram-standards-and-tools)
16+
* [LumoSQL Diagram Signature](#lumosql-diagram-signature)
17+
* [Using the LumoSQL Diagram Library](#using-the-lumosql-diagram-library)
18+
* [Adding Diagrams](#adding-diagrams)
19+
* [Diagram Style Guide](#diagram-style-guide)
20+
* [Image Standards and Tools](#image-standards-and-tools)
21+
* [Previewing Markdown before Pushing](#previewing-markdown-before-pushing)
22+
* [Copyright for LumoSQL Documentation](#copyright-for-lumosql-documentation)
23+
* [Metadata Header for Text Files](#metadata-header-for-text-files)
24+
* [Human Languages - 人类语言](#human-languages---人类语言)
25+
* [Creating and Maintaining Table of Contents](#creating-and-maintaining-table-of-contents)
26+
* [Tidying Markdown (mostly not required)](#tidying-markdown-mostly-not-required)
27+
28+
29+
LumoSQL Documentation Standards
30+
===============================
31+
32+
This chapter covers how LumoSQL documentation should be written and maintained.
33+
34+
![](./images/lumo-doc-standards-intro.jpg "Image from Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Chinese_books_at_a_library.jpg")
35+
36+
# Contributions to LumoSQL Documentation are Welcome
37+
38+
The first rule of LumoSQL documentation is "Yes please, we'd be delighted to
39+
receive patches and pull requests, in any way you want to make them". Anyone
40+
who has gone to the trouble to write down something useful about LumoSQL is our
41+
friend. We know there's a lot to fix.
42+
43+
If you want to make a quick documentation fix, then edit the Markdown and send
44+
it to us by any means you like, especially a Github Issue or Pull Request. You
45+
might just want to send us some improved paragraphs on their own. If this
46+
sounds like you, stop reading now and get on with sending us text :-)
47+
48+
If you want to do something more serious with the documentation then you need
49+
to read on, learning about our standards, recommended tools and processes.
50+
51+
* The main website text, under the directory `doc/` .
52+
** Text, such as this document you are reading, stored in the directory `doc/www`
53+
** Images, such as PNG or JPEG format, stored in `doc/www/images`
54+
** Images that are captured from videos and in the docs as thumbnails, also in `doc/www/images`
55+
56+
The Markdown files are standalone and complete - you can read them online just as they are.
57+
58+
The file `doc/www/Makefile` is an evolving tool to test these Markdown files, and soon will also
59+
be for generating images and probably the tables of contents.
60+
61+
# LumoSQL Respects Documentation for SQLite, LMDB and More
62+
63+
LumoSQL Documentation is standalone in evey way, including formats, tools and standards.
64+
65+
However, LumoSQL documentation refers to and should be consulted together with the [SQLite
66+
documentation](https://www.sqlite.org/docs.html), because with the following
67+
exceptions, LumoSQL works (or should work) in exactly the same way as SQLite.
68+
LumoSQL definitely not want to duplicate SQLite documentation, and regards the
69+
excellent SQLite documentation as definitive except where indicated.
70+
71+
Differences with SQLite arise:
72+
73+
* Where there is an extra/different storage backend to the SQLite Btree storage system
74+
* Where there are extra parameters in the user interface (commandline, API, pragmas) for another backend
75+
* When describing how the LumoSQL source tree works
76+
* When LumoSQL is working as other than an embedded library
77+
* When LumoSQL has an extra/different frontend to the SQLite SQL processor
78+
79+
It isn't only SQLite documentation that LumoSQL embraces. There is also [LMDB
80+
Documentation](http://www.lmdb.tech/doc/), and more to come as LumoSQL integrates more
81+
components. It is very important that LumoSQL not attempt to replicate these
82+
other documentation efforts that are kept up to date along with the corresponding code.
83+
84+
# Text Standards and Tools
85+
86+
LumoSQL documentation will be written in [Github-flavoured
87+
Markdown](https://github.github.com/gfm/) as supported by many tools including
88+
the well-known [Pandoc](https://pandoc.org), version 2.0 or higher. LumoSQL documentation will not be
89+
highly specific to any system. The main extension Github-flavoured Markdown
90+
(GFM) adds is tables and code blocks, and a single switch in Pandoc can change
91+
that dependency.
92+
93+
Text encoding will be [UTF-8](https://en.wikipedia.org/wiki/UTF-8) . Here is
94+
one [expert anecdote about why UTF-8 matters](https://yihui.org/en/2018/11/biggest-regret-knitr/).
95+
96+
Versions of Pandoc earlier than 2.0 did not support Markdown well as an output format, and the
97+
Lua extension system was insufficient for LumoSQL's HTML generation needs.
98+
99+
One difference between Pandoc Markdown and GFM is the number of spaces for nested lists. Two
100+
spaces are sufficient for GFM, but Pandoc requires four spaces.
101+
102+
# Diagram Standards and Tools
103+
104+
## LumoSQL Diagram Signature
105+
106+
The LumoSQL Diagram Signature is identical to the LumoSQL image signature. It should be
107+
placed on the bottom right hand corner of all diagrams created for LumoSQL, but not on
108+
diagrams from other sources unless modified for LumoSQL.
109+
110+
## Using the LumoSQL Diagram Library
111+
112+
The file images/lumo-diagram-library.odg is a LibreOffice Draw document containing all
113+
the elements likely needed for LumoSQL technical diagrams. If you find that you need to
114+
add a new element when making a diagram, you should also add it to this document.
115+
116+
The lumo-signature file is to be added to the base of all LumoSQL diagrams and images.
117+
It contains the logo and copyright string.
118+
119+
All other diagrams in images/ are PNG format final diagrams and SVG format process
120+
diagrams kept for ease of editing, as exported by LibreOffice, inkscape and others.
121+
122+
## Adding Diagrams
123+
124+
The current process for making diagrams is as follows.
125+
126+
1. Make in LibreOffice Draw.
127+
1.1 Reset corners of box elements to their proper radii (LibreOffice modifies this when scaling boxes).
128+
1.2. Export as SVG.
129+
2. Convert to png and add signature.
130+
4.1 Trim borders and output: `$ convert -density 200 -trim MyLbreOfficeOutput.svg MyNewDiagram.png`
131+
4.2 Re-border with space for the logo(adjust border as required if the signature doesn't fit): `$ convert MyNewDiagram.png -bordercolor white -border 40x40 -gravity south -splice 0x80 MyNewDiagram.png`
132+
4.3 Add logo and copyright information: `$ composite -density 200 -gravity SouthEast lumo-signature.svg MyNewDiagram.png MyNewDiagram.png`
133+
134+
## Diagram Style Guide
135+
136+
Colour palette: Libreoffice 'standard'.
137+
Fonts: *Source (Han) Sans Medium* or *Noto Sans Medium* due to their on-screen clarity and good language support (both are 100% compatible)
138+
Corner radii: OS and large container boxes: 0.4, small box elements: 0.25
139+
140+
# Image Standards and Tools
141+
142+
Images for LumoSQL documentation will be stored in /images/ and the
143+
filenames should start with `lumo-` . PNG should be the default image format,
144+
followed by JPG.
145+
146+
Include attribution in the alt-text tag. All images should have attribution,
147+
even if the LumoSQL project provided them. The caption should be left out if
148+
the image is self-evident and the alt-text also explains what the image is,
149+
This example is approximately from the top of this chapter:
150+
151+
```
152+
![Optional caption, eg "Chart of Badgers vs Profit"](./images/lumo-doc-standards-intro.jpg "Image from Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Chinese_books_at_a_library.jpg")
153+
```
154+
155+
# Previewing Markdown before Pushing
156+
157+
It's best to check syntax before pushing changes, which means rendering
158+
Markdown into HTML that is hopefully close to what Github produces. Here are three ways of doing that:
159+
160+
* The Makefile and support files in bin/ uses Pandoc to render the GFM to HTML in /tmp .
161+
* The excellent [Editor.md](https://github.com/pandao/editor.md) does a great job of rendering,
162+
as can be seen at [The Online Installation](https://pandao.github.io/editor.md/en.html) . You can paste GFM into it and see it rendered, WYSIWYG-style. You can download the HTML for
163+
Editor.md and run it locally. (Editor.md is also an editor, and it adds its own features, but you don't need to use it for that.)
164+
* You can use the Preview button on the Github user interface, for people whose workflow that suits.
165+
166+
# Copyright for LumoSQL Documentation
167+
168+
LumoSQL documentation is original and copyrighted under the
169+
[Creative Commons By-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode),
170+
except where indicated. Mostly it's better to link to the original, but if you
171+
need to cite paragraphs of someone else's documentation then attribute, and if
172+
more, check the license on the original.
173+
174+
The Creative Commons copyright applies to all LumoSQL documentation media.
175+
176+
Some documentation or media brings conditions of use with it, especially
177+
attribution, and this must be respected.
178+
179+
# Metadata Header for Text Files
180+
181+
The first lines of all LumoSQL documentation files should always be something like this:
182+
183+
```
184+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
185+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
186+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
187+
<!-- SPDX-FileType: Documentation -->
188+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
189+
```
190+
191+
# Human Languages - 人类语言
192+
193+
English is currently the main documentation language. Others are welcome, and
194+
not just as translations. For example, embedded SQL is particularly important
195+
in China and we welcome original content. To make it feel welcoming, we have tried
196+
to make all the illustrative images in LumoSQL inclusive of chinese language.
197+
198+
# Creating and Maintaining Table of Contents
199+
200+
LumoSQL had to make a decision about creating navigable ToC indexes. We would rather not
201+
write our own tools or scripts. At the moment the following is what we have.
202+
203+
The problem we have is summarised in a [well-known Github bug report](https://github.com/isaacs/github/issues/215):
204+
205+
> When I see a manually generated table of contents, it makes me sad.
206+
> When I see a huge README that is impossible to navigate without it, it makes me even sadder.
207+
> LaTeX has it. Gollum has it. Pandoc has it. So why not Github Format Markdown?
208+
209+
**LumoSQL Decision as of March 2020**: ToC Markdown must appear in the raw markdown. That means a TOC
210+
needs to be created and then inserted into the original source markdown file
211+
rather than automatically generated as part of an online rendering process or offline pipeline.
212+
213+
**Non-markdown metadata won't work:** With Pandoc, when writing, say, a report
214+
in Markdown, a tiny bit of metadata at the top of the file allows us to say
215+
`\tableofcontents` and `/usr/bin/pandoc` will then produce a beautiful PDF,
216+
and also other formats such as HTML. However, LumoSQL documentation needs to
217+
be processed by renderers that are a lot less sophisticated than Pandoc,
218+
including the Github markup processor. So we can't rely on metadata.
219+
220+
**Pandoc's Markdown output is improving but not yet good enough:** Pandoc can
221+
read Markdown and output Markdown, including a ToC. A command such as
222+
223+
```pandoc --standalone -f gfm -t gfm --toc -o lumo-output.md -i lumo-input.md```
224+
225+
is supposed to work and probably does, we just haven't seen it yet. Pandoc's Markdown
226+
output used to be poor, but since version 2.0 is has improved a lot. Pandoc --toc is
227+
hopefully the eventual answer, although as of 2.9 it doesn't seem to work at all, despite
228+
the documentation claiming it does.
229+
230+
**We are left with ad-hoc processing solutions for now:**
231+
232+
* Use the Github API: The most practical solution we have for now is the
233+
[github-markdown-toc](https://github.com/ekalinin/github-markdown-toc) bash
234+
script:
235+
236+
```
237+
$ https://raw.githubusercontent.com/ekalinin/github-markdown-toc/master/gh-md-toc
238+
$ ./gh-md-toc some-lumosql-document.md > /tmp/toc.md
239+
```
240+
241+
Then insert the file /tmp/toc.md into the document using your editor. It's not
242+
a pretty operation but given all the other advantages of Markdown it seems a
243+
small price to pay. This script can now be found in ```www/bin/gh-md-toc``` .
244+
It uses the Github API and therefore produces canonical results, so that means
245+
it needs internet access. After more testing, perhaps we can trust the
246+
`--insert` option and then include gd-md-toc in the documentation Makefile.
247+
248+
The way API works is made clear in the comments:
249+
250+
# Converts local md file into html by GitHub
251+
# $ curl -X POST --data '{"text": "Hello world github/linguist#1 **cool**, and #1!"}' https://api.github.com/markdown
252+
# <p>Hello world github/linguist#1 <strong>cool</strong>, and #1!</p>'"
253+
254+
gh-md-toc will insert a TOC between these markers:
255+
256+
```
257+
<!--ts-->
258+
<!--te-->
259+
```
260+
261+
meaning TOC could be handled in the Makefile, but that requires further thought.
262+
263+
* There are also options for doing Markdown TOC in editors such as vim, for example [vim-markdown-toc](https://github.com/mzlogin/vim-markdown-toc)
264+
265+
* Editor.md, referred to in the "Previewing Markdown Before Pushing" section
266+
above, will generate a table of contents where it sees the token `[TOC]` and a
267+
dropdown index TOC menu where it sees `[TOCM`. However since the output is HTML
268+
not markdown it is not so useful to LumoSQL (but it is very beautiful.)
269+
270+
# Tidying Markdown (mostly not required)
271+
272+
Tidying is about automatically adjusting the whitespace, pagebreaks and general formatting
273+
to be neat and consistent. But maybe you don't even need to, just write tidy
274+
text in the first place.
275+
276+
If you want to clean up someone else's Markdown, then stop and ask first.
277+
Automated cleanups and prettiers change hundreds of lines in a file without any
278+
effect on the output, and that makes a diff impossible to review, effectively
279+
rebasing it and destroying the history.
280+
281+
The documentation Makefile is not going to include any Markdown tidying because
282+
of the potential for making things worse. As of version 2.0 Pandoc works better
283+
for cleaning up markdown but isn't perfect. Parameters to experiment with include:
284+
285+
```
286+
-t gfm (triggers a few defaults, including headers in ATX style)
287+
--wrap=preserve (mostly limits changes to making headings ATX style)
288+
--columns=85 (stops most links breaking in editors doing syntax highlighting)
289+
```
290+

‎lumo-help-alien-language.md

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
![](./images/lumo-alien-language-intro.jpg "Confused Red Panda")
9+
10+
All You Need to Know
11+
====================
12+
13+
At the LumoSQL project we are improving something called SQLite that most people
14+
depend on, but are not aware of.
15+
16+
Choose Your Own Adventure
17+
=========================
18+
19+
If you want to be happier, you can see [More Pictures of Cute Red Pandas]()
20+
21+
> Or, you can choose to [Learn a bit about SQLite]()
22+
23+
If you want to
24+
25+
26+
27+

‎lumo-implementation.md

+145
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [LumoSQL Implementation](#lumosql-implementation)
12+
* [Table of Contents](#table-of-contents)
13+
* [Changes to SQLite](#changes-to-sqlite)
14+
* [Lockfile/tempfile Pushed to Backend](#lockfiletempfile-pushed-to-backend)
15+
* [SQLite API Interception Points](#sqlite-api-interception-points)
16+
* [SQLite Virtual Machine Layer](#sqlite-virtual-machine-layer)
17+
18+
19+
LumoSQL Implementation
20+
======================
21+
22+
![](./images/lumo-implementation-intro.jpg "Metro Station Construction Futian Shenzhen China, CC license, https://www.flickr.com/photos/dcmaster/36740345496")
23+
24+
25+
# Changes to SQLite
26+
27+
## Lockfile/tempfile Pushed to Backend
28+
29+
SQLite API Interception Points
30+
------------------------------
31+
32+
The process LumoSQL has largely completed as of March 2020 is:
33+
34+
1. Identify the correct API choke points to control, then
35+
2. Find useful chunks of code we want to switch between at these choke
36+
points to demonstrate the design.
37+
38+
The API interception points are:
39+
40+
1. Setup APIs/commandline/Pragmas, where we pass in info about what
41+
front/backends we want to use or initialise. Noting that SQLite is
42+
zero-config and so supplying no information to LumoSQL must always be an option.
43+
Nevertheless, if a user wants to select a particular backend, or have
44+
encryption or networking etc there will be some setup. Sqlite.org provides a
45+
large number of controls in pragmas and the commandline already.
46+
47+
2. SQL processing front ends. Code exists (see [Relevant Codebases](./lumo-relevant-codebases.md)
48+
that implements MySQL-like behaviour in parallel with supporting SQLite semantics.
49+
There is a choice codebases to do that with, covering different approaches to the problem.
50+
51+
3. Transaction interception and handling, which in the case of the LMDB
52+
backend will be pass-through but in other backends may be for replicated
53+
storage, or backup. This interception point would be in ```wal.c``` if all
54+
backends used a writeahead log and used it in a similar way, but they do not.
55+
Instead this is where the new ```backend.c``` API interception point will be
56+
used - see further down in this document. This is where, for example, we can
57+
choose to add replication features to the standard SQLite btree storage
58+
backend.
59+
60+
4. Storage backends, being a choice of native SQLite btree or LMDB today, and
61+
swiftly after that other K-V stores. This is the choke point where we expect to
62+
introduce [libkv](./lumo-relevant-codebases#libkv), or a modification of libkv.
63+
64+
5. Network layers, which will be at all of the above, depending whether they
65+
are for client access to the parser, or replicating transactions, or being
66+
plain remote storage etc.
67+
68+
In most if not all cases it needs to be possible to have multiple choices
69+
active at once, including the obvious cases of multiple parsers and multiple
70+
storage backends, for example. This is because one of the important new use
71+
cases for LumoSQL will be conversion between formats, dialects and protocols.
72+
73+
Having designed the API architecture we can then produce a single LumoSQL tree
74+
with these choke point APIs in place and proof of two things:
75+
76+
1. ability to have stock-standard identical SQLite APIs and on-disk
77+
btree format, and
78+
79+
2. an example of an alternative chunk of code at each choke point:
80+
MySQL; T-pipe writing out the transaction log in a text file; LMDB .
81+
Not necessarily with the full flexibility of having all code active at
82+
once if that's too hard (ie able to take any input SQL and store in
83+
any backend)
84+
85+
and then, having demonstrated we have a major step forward for the entire world,
86+
87+
3. Identify what chunks of SQLite we really don't want to support any more.
88+
Like maybe the ramdisk pragma given that we can/should/might have an
89+
in-memory storage backend, which initially might just be LMDB with overcommit
90+
switched off. This is where testing and benchmarking really matters.
91+
92+
SQLite Virtual Machine Layer
93+
----------------------------
94+
95+
In order to support multiple backends, LumoSQL needs to have a more general way
96+
of matching capabilities to what is available, whether a superset or a subset of
97+
what SQLite currently does. This needs to be done in such a way that it remains
98+
easy to track upstream SQLite.
99+
100+
The SQLite architecture has the SQL virtual machine in the middle of everything:
101+
102+
`vdbeapi.c` has all the functions called by the parser
103+
`vdbe.c` is the implementation of the virtual machine, and and it is
104+
from here that calls are made into btree.c
105+
106+
All changes to SQLite storage code will be in vdbe.c , to insert an
107+
API shim layer for arbitary backends. All BtreeXX function calls will
108+
be replaced with backendXX calls.
109+
110+
`lumo-backend.c` will contain:
111+
112+
* a switch between different backends
113+
* a virtual method table of function calls that can be stacked, for
114+
layering some generic functionality on any backends that need it as
115+
follows
116+
117+
`lumo-index-handler.c` is for backends that need help with index
118+
and/or key handling. For example some cannot have arbitary length
119+
keys, like LMDB. RocksDB and others do not suffer from this.
120+
`lumo-transaction-handler.c` is for backends that do not have full
121+
transaction support. RocksDB for example is not MVCC, and this will
122+
add that layer. Similarly this is where we can implement functionality
123+
to upgrade RO transactions to RW with a commit counter.
124+
`lumo-crypto.c` provides encryption services transparently backends
125+
depending on a decision made in lumo-backend.c, which will cover
126+
everything except backend-specific metadata. Full disk encryption of
127+
everything has to happen at a much lower layer, like SQLite's idea of
128+
a VFS. The VFS concept will not translate entirely, because the very first
129+
alternative backend is based on mmap, and which will need special handling. So we are for now expecting to implement a lumo-vfs-mmap.c and a lumo-vfs.c .
130+
`lumo-vfs.c` provides VFS services to backends, and is invoked by
131+
backends. `lumo-vfs.c` may call lumo-crypto for full file encryption
132+
including backend metadata depending on the VFS being implemented.
133+
134+
Backend implementations will be in files such as `backend-lmdb.c`,
135+
`backend-btree.c`, `backend-rocksdb.c` etc.
136+
137+
This new architecture means:
138+
139+
1. Features such as WALs or paging or network paging etc are specific to the backend, and invisible to any other LumoSQL or SQLite code.
140+
2. Bug-for-bug compatibility with the orginal SQLite btree.c can be maintained (except in the case of encryption, which no open source users have access to anyway.)
141+
3. New backends with novel features (and LMDB is novel enough, for a first example!) can be introduced without disturbing other code, and being able to be benchmarked and tested safely.
142+
143+
144+
145+

‎lumo-landscape.md

+410
Large diffs are not rendered by default.

‎lumo-legal-aspects.md

+177
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
![](./images/lumo-legal-aspects-intro.png "XXXXXXXX")
8+
9+
Table of Contents
10+
=================
11+
12+
* [Table of Contents](#table-of-contents)
13+
* [LumoSQL Licensing](#lumosql-licensing)
14+
* [Why MIT? Why Not MIT?](#why-mit-why-not-mit)
15+
* [In Detail: Patents, MIT and Apache 2.0](#in-detail-patents-mit-and-apache-20)
16+
* [In Detail: the SQLite Public Domain Licensing Problem](#in-detail-the-sqlite-public-domain-licensing-problem)
17+
* [History and Rationale](#history-and-rationale)
18+
* [Encryption Legal Issues](#encryption-legal-issues)
19+
* [LumoSQL Requirements and Decisions](#lumosql-requirements-and-decisions)
20+
21+
# LumoSQL Licensing
22+
23+
SQLite is released as [Public Domain](https://www.sqlite.org/copyright.html).
24+
In order to both respect and improve on this, the [LumoSQL Project Aims](lumo-projet-aims.md) make this promise to SQLite users:
25+
26+
> LumoSQL will not come with legal terms less favourable than SQLite. LumoSQL
27+
> will try to improve the legal standing and safety worldwide as compared to
28+
> SQLite.
29+
30+
To achieve this LumoSQL has made these policy decisions:
31+
32+
* New LumoSQL code is licensed under the [MIT License](https://opensource.org/licenses/MIT), as used by many large corporations worldwide
33+
* LumoSQL documentation is licensed under the [Creative Commons](https://creativecommons.org/licenses/by-sa/4.0/)
34+
* Existing and future SQLite code is relicenced by the act of being distributed under the terms of the MIT license
35+
* Open Source code from elsewhere, such as backend data stores, remain under the terms of the original license except where distribution under MIT effectively relicenses it
36+
* Open Content documentation from elsewhere remains under the terms of the original license. No documentation is used in LumoSQL unless it can be freely mixed with any other documentation.
37+
38+
The effect of these policy decisions are:
39+
40+
* LumoSQL users gain certainty as compared with SQLite users because they have a
41+
license that is recognised in jurisdictions worldwide.
42+
43+
* LumoSQL users do not lose any rights. For example, the MIT license permits use
44+
with fully proprietary software, by anyone. Whatever users do today with
45+
SQLite they can continue to do with LumoSQL.
46+
47+
* While MIT does require users to include a copy of the license and the
48+
copyright notice, the MIT license also permits the user to remove the
49+
sentence requiring this from the license (thus re-licensing LumoSQL.)
50+
51+
# Why MIT? Why Not MIT?
52+
53+
Github's [License Chooser for MIT](https://choosealicense.com/licenses/mit/) describes the MIT as:
54+
55+
> A short and simple permissive license with conditions only requiring
56+
> preservation of copyright and license notices. Licensed works, modifications,
57+
> and larger works may be distributed under different terms and without source
58+
> code.
59+
60+
The MIT license aims to get out of the way of software developers, and despite
61+
some flaws it appears to do so reliably.
62+
63+
In addition, MIT is popular. As documented [on Wikipedia](https://en.wikipedia.org/wiki/MIT_License) MIT appears to be the most-used open source licenses. Popularity matters, because all licenses are in part a matter of community belief and momentum. Microsoft releasedi
64+
[.NET Core](https://en.wikipedia.org/wiki/.NET_Core) and Facebook released
65+
[React](https://en.wikipedia.org/wiki/React_(web_framework)) under the MIT, and
66+
these companies are very cautious about the validity of the licenses they use.
67+
68+
In a forensic article analysing [the 171 words of the MIT license](https://writing.kemitchell.com/2016/09/21/MIT-License-Line-by-Line.html) as they apply in the US, lawyer Kyle E. Mitchell writes in his conclusion:
69+
70+
> The MIT License is a legal classic. The MIT License works. It is by no means
71+
> a panacea for all software IP ills, in particular the software patent
72+
> scourge, which it predates by decades. But MIT-style licenses have served
73+
> admirably... We’ve seen that despite some crusty verbiage and lawyerly
74+
> affectation, one hundred and seventy one little words can get a hell of a lot
75+
> of legal work done, clearing a path for open-source software through a dense
76+
> underbrush of intellectual property and contract.
77+
78+
Overall, in LumoSQL we have concluded that the MIT license is solid and it is
79+
better than any other mainstream license for existing SQLite users. It is
80+
certainly better than the SQLite Public Domain terms.
81+
82+
# In Detail: Patents, MIT and Apache 2.0
83+
84+
LumoSQL has a narrower range of possible licenses because of its nature as an
85+
embedded library, where it is tightly combined with users' code. This means
86+
that the terms and conditions for using LumoSQL have to be as open as possible
87+
to accommodate all the different legal statuses of software that users combine
88+
with LumoSQL. And the status that worries corporate lawyers the most is
89+
"unknown". What if you aren't completely sure of the patent status of the
90+
software, or the intentions of your company? And where there is uncertainty,
91+
users are wise not to commit.
92+
93+
LumoSQL has tried hard to bring more certainty, not less, and this is tricky when it comes to patents.
94+
95+
Software patents are an issue in many jurisdictions. The MIT license includes a
96+
grant of patents to its users, as [explained by the Open Source Initiative](https://opensource.com/article/18/3/patent-grant-mit-license),
97+
including in the grant "... to deal in the software without restriction." While the
98+
Apache 2.0 license specifically grants patent rights (as do the GPL and MPL), they are not more generous than the MIT license. There is some debate that varies by jurisdiction about exactly how clear the patent grant is, as documented in [the patent section on Wikipedia](https://en.wikipedia.org/wiki/MIT_License#Relation_to_patents).
99+
100+
The difficulty is that the Apache 2.0 (similar to the GPL and MPL) license also
101+
includes a *patent retaliation* clause:
102+
103+
> If You institute patent litigation against any entity (including a
104+
> cross-claim or counterclaim in a lawsuit) alleging that the Work or a
105+
> Contribution incorporated within the Work constitutes direct or contributory
106+
> patent infringement, then any patent licenses granted to You under this
107+
> License for that Work shall terminate as of the date such litigation is
108+
> filed.
109+
110+
The intention is progressive and seemingly a Good Thing - after all, unless you
111+
are a patent troll who wants more pointless patent litigation? However the
112+
effect is that the Apache 2.0 license brings with it the requirement to check
113+
for patent issues in any code it is connected to. It also is possible that the
114+
company using LumoSQL actually does want the liberty to take software patent
115+
action in court. So whether by the risk or the constraint, Apache 2.0 brings with it
116+
significant change compared to SQLite's license terms in countries that recognise them.
117+
118+
MIT has only a patent grant, not retaliation. That is why LumoSQL does not use the Apache 2.0 license.
119+
120+
121+
# In Detail: the SQLite Public Domain Licensing Problem
122+
123+
There are numerous reasons other than licensing why SQLite is less open source
124+
than it appears, and these are covered in the [LumoSQL Landscape](./lumo-landscape.md). As to licensing, SQLite is distributed as
125+
Public Domain software, and this is mentioned by D Richard Hipp in his [2016 Changelog Podcast Interview](https://changelog.com/podcast/201). Although he is aware of the problems, Hipp has decided not to introduce changes.
126+
127+
The [Open Source Initiative](https://opensource.org/node/878) explains the Public Domain problem like this:
128+
129+
> “Public Domain” means software (or indeed anything else that could be
130+
> copyrighted) that is not restricted by copyright. It may be this way because
131+
> the copyright has expired, or because the person entitled to control the
132+
> copyright has disclaimed that right. Disclaiming copyright is only possible
133+
> in some countries, and copyright expiration happens at different times in
134+
> different jurisdictions (and usually after such a long time as to be
135+
> irrelevant for software). As a consequence, it’s impossible to make a
136+
> globally applicable statement that a certain piece of software is in the
137+
> public domain.
138+
139+
Germany and Australia are examples of countries in which Public Domain is not
140+
normally recognised which means that legal certainty is not possible for users
141+
in these countries who need it or want it. This is why the Open Source
142+
Initiative does not recommend it and nor does it appear on the [SPDX License List](https://spdx.org/licenses/).
143+
144+
The SPDX License List is a tool used by many organisations to understand where they stand legally with the millions of lines of code they are using. David A Wheeler has produced a helpful [SPDX Tutorial](https://github.com/david-a-wheeler/spdx-tutorial) . All code and documentation developed by the LumoSQL project has a SPDX identifier.
145+
146+
# History and Rationale
147+
148+
SQLite Version 1 used the gdbm key-value store. This was under the GPL and
149+
therefore so was SQLite. gdbm is limited, and is not a binary tree. When
150+
Richard Hipp replaced it for SQLite version 2, he also dropped the GPL. SQLite
151+
has been released as "Public Domain"
152+
153+
154+
# Encryption Legal Issues
155+
156+
SQLite is not available with encryption. There are two common ways of adding encryption to SQLite, both of which have legal implications:
157+
158+
1. Purchasing the [SQLite Encryption Extension](https://www.hwaci.com/sw/sqlite/see.html)(SEE) from Richard Hipp's company Hwaci. The SEE is proprietary software, and cannot be used with open source applications.
159+
2. [SQLcipher](https://www.zetetic.net/sqlcipher/) which has a open core model. The BSD-licensed open source version requires users to publish copyright notices, and the more capable commercial editions are available on similar terms to SEE, and therefore cannot be used with open source applications.
160+
161+
There are many other ways of adding encryption to SQLite, some of which are listed in the [Knowledgebase Relevant to LumoSQL](./lumo-relevant-knowledgebase.md).
162+
163+
The legal issues addressed in LumoSQL encryption include:
164+
165+
* Usability. Encryption should be available with LumoSQL in the core source code without having to consider any additional legal considerations.
166+
* Unemcumbered. No encryption code is used that may reasonably be subject to action by companies (eg copyright claims) or governments (eg export regulations). Crypto code will be reused from known-safe sources.
167+
* Compliant with minimum requirements in various jurisdictions. With encryption being legally mandated or strongly recommended in many jurisdictions for particular use cases (banking, handling personal data, government data, etc) there are also minimum requirements. LumoSQL will not ship crypto code that fails minimum crypto requirements.
168+
* Conspicuously *non-compliant* with maximum requirements in any jurisdiction. LumoSQL will not limit its encryption mechanisms or strength to comply with any legal restrictions, in common with other critical open source infrastructure. LumoSQL crypto tries to be as hard to break as possible regardless of the use case or jurisdiction.
169+
170+
171+
Local laws
172+
EU laws
173+
Facts of Privacy and security
174+
175+
# LumoSQL Requirements and Decisions
176+
177+

‎lumo-not-forking.md

+301
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Claudio Calvelli, March 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [Not-Forking Upstream Source Code Tracker](#not-forking-upstream-source-code-tracker)
12+
* [Table of contents](#table-of-contents)
13+
* [Upstream definition file <a name="user-content-upstream"></a>](#upstream-definition-file-)
14+
* [git](#git)
15+
* [download](#download)
16+
* [Modification definition file <a name="user-content-modification"></a>](#modification-definition-file-)
17+
* [Example Configuration directory <a name="user-content-example"></a>](#example-configuration-directory-)
18+
* [Not-forking tool <a name="user-content-tool"></a>](#not-forking-tool-)
19+
20+
Not-Forking Upstream Source Code Tracker
21+
========================================
22+
23+
The LumoSQL project incorporates software from other projects and some of that
24+
software needs some modifications. Rather than fork our own version, we have
25+
developed a mechanism which we call "not-forking" to semi-automatically track
26+
upstream changes.
27+
28+
The mechanism is similar to applying patches; however patches need to be
29+
constantly updated as upstream sources changes, and the not-forking mechanism
30+
helps with that. The overall effect is something like git cherry-picking,
31+
except that it also copes with:
32+
* human-style software versioning
33+
* code that is not maintained in the same git repo
34+
* code that is not maintained in git, but is just patches or in some other VCS
35+
* custom processing that is needed to be run for a specific patch
36+
* failing with an error asking for human intervention to solve differences with upstream
37+
38+
etc.
39+
40+
Each project tracked by not-forking needs to define what to track, and what
41+
changes to apply. This is done by providing a number of files in a directory;
42+
the minimum requirement is an upstream definition file; other files can also be
43+
present indicating what modifications to apply (if none are provided, the
44+
upstream sources are used unchanged).
45+
46+
# Upstream definition file <a name="upstream"></a>
47+
48+
The file `upstream.conf` has a simple "key = value" format with one such
49+
key, value pair per line: blank lines and lines whose first nonblank
50+
character is a hash (`#`) are ignored; long lines can be split into multiple
51+
lines by ending a line with a backslash meaning continuation into the
52+
next line.
53+
54+
There is a special line format to indicate conditionals; currently, the
55+
only condition which can be tested is whether the version number is in
56+
a specified range, using the syntax:
57+
58+
```
59+
if version \[>\[=\] FIRST\_VERSION\] \[<\[=\] LAST\_VERSION\]
60+
...
61+
[else ...]
62+
endif
63+
```
64+
65+
If a key is present more than once, the last value seen wins; therefore,
66+
it is possible to define a key inside a conditional block, and then to
67+
define it again outside the block to provide a default value.
68+
69+
The only key which must be present is `vcs`, and there is no default.
70+
It indicates what kind of version control system to use to obtain upstream
71+
sources; the value is the name of a version control module defined by the
72+
not-forking mechanism; at the time of writing `git` and `download` are valid
73+
values; in general, the documentation for the corresponding version control
74+
module defines what else is present in the `upstream.conf` file; this document
75+
describes briefly the configuration for the above two modules.
76+
77+
Optionally, two other keys can be present: `compare` and `subtree`.
78+
79+
The `compare` key indicates what method to use to compare two different
80+
version numbers; if omitted, it default to `version` which compares
81+
"normal" software version numbers: sequences of digits compare
82+
numerically, and sequences of letters compare alphabetically, with the
83+
exception that a suffix "-alpha" or "-beta" cause the version to be
84+
considered before the string without such suffix: examples of version
85+
numbers in order are:
86+
87+
- `0.9a` < `0.9z` < `0.10` < `1.0` < `1.1-alpha` < `1.1-beta` < `1.1` < `1.1a`
88+
89+
This definition will even cope with the numbering scheme used by TeX and
90+
METAFONT which are "Pi" and "e" respectively. The definition can be extended to
91+
deal with version numbering schemes used by normal software, however it will
92+
never work correctly with the version numbers used by some software such as the
93+
[CLC-INTERCAL](https://en.wikipedia.org/wiki/INTERCAL#Version_Numbers)
94+
compiler.
95+
96+
The `subtree` key indicates a directory inside the sources to use instead
97+
of the top level.
98+
99+
## git
100+
101+
The upstream sources are available via a public git repository; the following
102+
keys need to be present:
103+
104+
- `repos` (or `repository`) is a valid argument to the `git clone` command.
105+
- optionally, `branch` to select a branch within the repository.
106+
- optionally, `version` to convert a version string to a tag: the value is
107+
either a single string which is prefixed to the version number, or two
108+
strings separated by space, the first one is prefixed and the second appended.
109+
- optionally, `user` and `password` can be specified to obtain access to the
110+
repository (this is currently not implemented, all repositories must be
111+
accessible without authentication).
112+
113+
A software version can be identified by a generic git commit ID, or by a
114+
version string similar to the one described for the `compare` key, if the
115+
repository offers that as an option.
116+
117+
## download
118+
119+
The upstream sources are released as published versions and downloaded
120+
directly; the following keys need to be present:
121+
122+
- `uri` indicates where to obtain these sources, and can contain the special
123+
symbol `%V` to indicate the version or `%%` to indicate just a percentage
124+
sign (`%`)
125+
126+
TBC - we also need to say how to unpack the sources etc
127+
128+
# Modification definition file <a name="modification"></a>
129+
130+
There can be zero or more modification definition files in the configuration
131+
directory; each file has a name ending in `.mod` and they are processed
132+
in lexycographic order according to the "C" locale (rather than the current
133+
locale, to guarantee consistent ordering). Note that only files are
134+
considered; if the configuration directory contains subdirectories, these
135+
are ignored, but files in there can be referenced by the `.mod` files.
136+
137+
The contents of each modification definition file are an initial part with
138+
format similar to the Upstream definition file described above ("key = value"
139+
pair, possibly with conditional blocks); this initial part ends with a line
140+
containing just dashes and the rest of the file, referred to as "final
141+
part", is interpreted based on information from the initial part.
142+
143+
The following keys are currently understood:
144+
145+
- `version`: the value has the same format as the condition on the
146+
`if version` specification in the Upstream definition file: one or two
147+
strings separated by whitespace, one of the strings starting with `<`
148+
or `<=` and the other starting with `>` or `>=` to indicate a maximum,
149+
minimum or range of versions. One use of this key is to indicate that
150+
a modification is only necessary up to a particular version, because
151+
for example that modification has been accepted by upstream and is
152+
no longer necessary. Another use of this key is to identify versions
153+
in which substantial upstream changes make it difficult to specify a
154+
modification which works for every possible version. Specifying this
155+
keyword is essentially equivalent to put the whole `.mod` file in
156+
a conditional.
157+
- `method`; the method used to specify the modification; currently, the
158+
value can be either `patch`, indicating that the final part of the file is
159+
in a format suitable for passing as standard input to the "patch" program;
160+
or `replace` indicating that one or more files in the upstream must be
161+
completely replaced; the final part of the file contains one or more
162+
lines with format "old-file = new-file", where both are relative paths,
163+
the first relative to the root of the extracted upstream sources; the
164+
second path is relative to the configuration directory.
165+
166+
Other keys are interpreted depending on the value of `method`; there are
167+
currently no other keys for the `replace` method, and the following for
168+
the `patch` method:
169+
170+
- `options`: options to pass to the "patch" program (default: "-Nsp1")
171+
- `list`: extra options to the "patch" program to list what it would do
172+
instead of actually doing it (this is used internally to figure out
173+
what changes; the default currently assumes the "patch" program provided
174+
by most Linux distributions)
175+
176+
# Example Configuration directory <a name="example"></a>
177+
178+
Obtaining SQLite sources and replacing btree.c and btreeInt.h with the ones
179+
from sqlightning, and applying a patch to vdbeaux.c:
180+
181+
File `upstream.conf`:
182+
183+
```
184+
vcs = git
185+
repos = https://github.com/sqlite/sqlite.git
186+
```
187+
188+
File `btree.mod`:
189+
190+
```
191+
method = replace
192+
--
193+
src/btree.c = files/btree.c
194+
src/btreeInt.h = files/btreeInt.h
195+
```
196+
197+
File `vdbeaux.mod`:
198+
```
199+
method = patch
200+
--
201+
--- sqlite-git/src/vdbeaux.c 2020-02-17 19:53:07.030886721 +0100
202+
+++ new/src/vdbeaux.c 2020-03-21 13:52:24.861586555 +0100
203+
@@ -2778,7 +2778,7 @@
204+
for(i=0; i<db->nDb; i++){
205+
Btree *pBt = db->aDb[i].pBt;
206+
if( sqlite3BtreeIsInTrans(pBt) ){
207+
- char const *zFile = sqlite3BtreeGetJournalname(pBt);
208+
+ char const *zFile = BackendGetJournal(pBt);
209+
if( zFile==0 ){
210+
continue; /* Ignore TEMP and :memory: databases */
211+
}
212+
```
213+
214+
Files `files/btree.c` and `files/btreeInt.h`: the new contents.
215+
216+
A more complete example can be found in the directory "not-fork.d/sqlite"
217+
which tracks upstream updates from SQLite.
218+
219+
# Not-forking tool <a name="tool"></a>
220+
221+
The `tool` directory contain a script, `not-fork` which runs the not-forking
222+
mechanism on a directory. Usage is:
223+
224+
not-fork \[OPTIONS\] \[NAME\]...
225+
226+
where the following options are available:
227+
228+
- `-i`INPUT\_DIRECTORY (or `--input=`INPUT\_DIRECTORY)
229+
is a not-forking configuration directory as specified
230+
in this document; default is `not-fork.d` within the current directory
231+
- `-o`OUTPUT\_DIRECTORY (or `--output=`OUTPUT\_DIRECTORY)
232+
is the place where the modified upstream sources will
233+
be stored, and it can be either a directory created by a previous run of
234+
this tool, or a new directory (missing or empty directory); default is
235+
`sources` within the current directory; note that existing sources in
236+
this directory may be overwritten or deleted by the tool
237+
- `-c`CACHE\_DIRECTORY (or `--cache=CACHE\_DIRECTORY`)
238+
is a place used by the program to keep downloads
239+
and working copies; it must be either a new (missing or empty) directory
240+
or a directory created by a orevious run of the tool; default is
241+
`.cache/LumoSQL/not-fork` inside the user's home directory
242+
- `-v`VERSION (or `--version=`VERSION) will retrieve the specified VERSION
243+
of the next NAME (this option must be repeated for each NAME, in the
244+
assumption that different projects have different version numbering)
245+
- `-c`COMMIT\_ID (or `--commit=`COMMIT\_ID) is similar to `-v` but
246+
only works for version control modules which support commit identifiers,
247+
and will retrieve the corresponding commit for the next NAME, whether
248+
or not it has an official version number; this is incompatible with `-v`
249+
- `-q` (or `--query`) completes all necessary downloads but do not
250+
extract the sources and apply modifications, instead it shows some
251+
information about what has been downloaded, including a version number
252+
if available.
253+
254+
If neither VERSION nor COMMIT\_ID is specified, the default is the latest
255+
available version, if it can be determined, or else an error message.
256+
If more than one NAME is specified, VERSION and COMMIT\_ID need to
257+
be provided before each NAME: the assumption is that different
258+
software projects use different version numbers.
259+
260+
If one or more NAMEs are specified, the tool will obtain the upstream
261+
sources as described in INPUT\_DIRECTORY/NAME for each of the NAMEs
262+
specified, and attempt to apply all the required modifications; if that
263+
succeeds, OUTPUT\_DIRECTORY/NAME will contain the modified sources ready
264+
to use; if that fails, an error message will explain the problem and if
265+
possible suggest corrective action (for example, if `patch` determines
266+
that a file has changed too much that it cannot figure out how to apply
267+
a patch supplied, the error message will indicate this and suggest to
268+
obtain a new patch for that version of the sources).
269+
270+
If no NAMEs are specified, the tool, will process all subdirectories
271+
of INPUT\_DIRECTORY. In this special case, any VERSION or COMMIT\_ID
272+
specified will apply to all rather than just the name immediately
273+
following them.
274+
275+
The tool looks for a configuration file located at
276+
`$HOME/.config/LumoSQL/not-fork.conf` to read defaults; if the file exists
277+
and is readable, any non-comment, non-empty lines are processed before
278+
any command-line options with an implicit `--` prepended and with spaces
279+
around the first `=` removed, if present: so for example a file containing:
280+
281+
```
282+
cache = /var/cache/LumoSQL/not-fork
283+
```
284+
285+
would change the default cache from `.cache/LumoSQL/not-fork` in the user's
286+
home directory to the above directory inside `/var/cache`; it can still
287+
be overridden by specifying `-c`/`--cache` on the command line.
288+
289+
The program will refuse to overwrite the output directory if it cannot
290+
determine that it has been created by a previous run and that files have
291+
not been modified since; in this case, delete the output directory
292+
completely, or rename it to something else, and run the program again.
293+
There is currently no option to override this safety feature.
294+
295+
We plan to add logging to the not-forking tool, in which all messages are
296+
written to a log file (under control of configuration), while the subset
297+
of messages selected by the verbosity setting will go to standard output;
298+
this will allow us to increase the amount of information provided and make
299+
it available if there is a processing error; however in the current version
300+
this is just planned, and not yet implemented.
301+

‎lumo-project-aims.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [Overall Objective of LumoSQL](#overall-objective-of-lumosql)
12+
* [Table of Contents](#table-of-contents)
13+
* [Aims](#aims)
14+
* [Short Term Goals](#short-term-goals)
15+
16+
17+
![](./images/lumo-project-aims-intro.jpg "Mongolian horseback archery, rights request pending from https://www.toursmongolia.com/")
18+
19+
Overall Objective of LumoSQL
20+
============================
21+
22+
To create Privacy-compliant Open Source Database Platform with Modern Design and Benchmarking,
23+
usable either embedded or online.
24+
25+
This is the guide for every aspect of the project, which will ensure that
26+
LumoSQL offers features that money can't buy, and drawing together an
27+
SQLite-related ecosystem.
28+
29+
The rest of this document will be updated frequently in 2020, and over time
30+
will become more strategic and with less listing of specific new features.
31+
32+
Aims
33+
====
34+
35+
* SQLite upstream promise: LumoSQL will not fork SQLite, and will offer 100%
36+
compatibility with SQLite by default, and contribute to SQLite where possible.
37+
This especially includes the SQLite user interface mechanisms of pragmas,
38+
library APIs, and commandline parameters.
39+
40+
* Legal promise: LumoSQL will not come with legal terms less favourable than
41+
SQLite. LumoSQL will try to improve the legal standing and safety worldwide
42+
as compared to SQLite.
43+
44+
* Developer contract: LumoSQL will have stable APIs ([Application Programming Interfaces](https://en.wikipedia.org/wiki/Application_programming_interface#Libraries_and_frameworks)) for features found in multiple unrelated SQLite downstream projects:
45+
backends, frontends, encryption, networking and more.
46+
47+
* Devops contract: LumoSQL will reduce risk by making it possible to omit
48+
compliation of many features, and will have stable ABIs ([Application Binary Interfaces](https://en.wikipedia.org/wiki/Application_binary_interface))so as to not break dynamically-linked applications.
49+
50+
* Ecosystem creation: LumoSQL will offer consolidated contact, code curation, bug tracking,
51+
licensing, and community communications across all these features from
52+
other projects. Bringing together SQLite code contributions under one umbrella reduces
53+
technical risk in many ways, from inconsistent use of threads to tracking updated versions.
54+
55+
56+
Short Term Goals
57+
================
58+
59+
* LumoSQL will have three canonical and initial backends: btree (the existing
60+
SQLite btree, ported to a new backend system); a test backend such as text or
61+
csv; and the LMDB backend. Control over these interfaces will be through the
62+
same user interface mechanisms as the rest of LumoSQL, and SQLite.
63+
64+
* LumoSQL will improve SQLite quality and privacy compliance by introducing
65+
optional on-disk checksums for storage backends including to the original
66+
SQLite btree format. This will give real-time row-level corruption detection.
67+
68+
* LumoSQL will improve SQLite quality and privacy compliance by introducing
69+
optional storage backends that are more crash-resistent, starting with LMDB
70+
followed by others.
71+
72+
* LumoSQL will improve SQLite integrity in persistent storage by introducing
73+
optional row-level checksums.
74+
75+
* LumoSQL will provide the benefits of Open Source and an open project
76+
by continuing to accept and review contributions in an open way, using
77+
github and having diverse contributors, and being careful to use open
78+
source licenses
79+
80+
* LumoSQL will improve SQLite design by intercepting APIs at a very small
81+
number of critical choke-points, and giving the user optional choices at
82+
these choke points. The choices will be for alternative storage backends,
83+
front end parsers, encryption, networking and more, all without removing
84+
the zero-config and embedded advantages of SQLite
85+
86+
* LumoSQL will provide a means of tracking upstream SQLite, by making
87+
sure that anything other than the API chokepoints can be synched at each
88+
release, or more often if need be
89+
90+
* LumoSQL will provide updated, public testing tools, with results published
91+
and instructions for reproducing the test results. This also means
92+
excluding parts of the LumoSQL test suite that don't apply to new backends
93+
94+
* LumoSQL will provide benchmarking tools, otherwise as per the testing
95+
tools
96+
97+
* LumoSQL will ensure that new code remains optional by means of modularity at
98+
compiletime and also runtime. By illustration of modularity, at compiletime
99+
nearly all 30 million lines of the Linux kernel can be exclude giving just 200k
100+
lines. Runtime modularity will be controlled through the same user interfaces
101+
as the rest of LumoSQL.
102+
103+
* LumoSQL will ensure that new code can all be active at once, eg
104+
multiple backends or frontends for conversion between/upgrading from one
105+
format or protocol to another. This is crucial to provide continuity and
106+
supported upgrade paths for users, for example, users who want to become
107+
privacy-compliant without disrupting their end users
108+
109+
* Over time, LumoSQL will carefully consider the potential benefits of dropping
110+
some of the most ancient parts of SQLite when merging from upstream, provided
111+
it does not conflict with any of the other goals in this document. Eliminating
112+
SQLite code can be done by a similar non-forking mechanism as used to keep in synch
113+
with the SQLite upstream.
114+
115+
116+
117+

‎lumo-quickstart.md

+253
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
<!-- SPDX-License-Identifier: AGPL-3.0-only -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors, 2019 Oracle -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2020 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [About LumoSQL](#about-lumosql)
12+
* [About the LumoSQL Project](#about-the-lumosql-project)
13+
* [LumoSQL Interfaces Are Almost the Same as SQLite](#lumosql-interfaces-are-almost-the-same-as-sqlite)
14+
* [Building and Installing LumoSQL](#building-and-installing-lumosql)
15+
* [Directory layout](#directory-layout)
16+
* [Linux/Unix](#linuxunix)
17+
* [Build environment](#build-environment)
18+
* [Using the Makefile tool](#using-the-makefile-tool)
19+
* [Running LumoSQL](#running-lumosql)
20+
* [Windows](#windows)
21+
* [Android](#android)
22+
* [Speed tests / benchmarking](#speed-tests--benchmarking)
23+
* [Which LMDB version?](#which-lmdb-version)
24+
* [References](#references)
25+
26+
27+
About LumoSQL
28+
=============
29+
30+
LumoSQL is a combination of two embedded data storage C language libraries:
31+
[SQLite](https://sqlite.org) and [LMDB](https://github.com/LMDB/lmdb). LumoSQL
32+
is an updated version of Howard Chu's 2013
33+
[proof of concept](https://github.com/LMDB/sqlightning) combining the codebases.
34+
Howard's LMDB library has become an ubiquitous replacement for
35+
[bdb](https://sleepycat.com/) on the basis of performance, reliability, and
36+
license so the 2013 claims of it greatly increasing the performance of SQLite
37+
seemed credible. D Richard Hipp's SQLite is used in thousands of software
38+
projects, and since three of them are Google's Android, Mozilla's Firefox and
39+
Apple's iOS, an improved version of SQLite will benefit billions of people.
40+
41+
About the LumoSQL Project
42+
=========================
43+
44+
LumoSQL was started in December 2019 by Dan Shearer, who did the original source
45+
tree archaeology, patching and test builds. Keith Maxwell joined shortly after
46+
and contributed version management to the Makefile and the benchmarking tools.
47+
48+
A main goal of the LumoSQL Project is to create and maintain an improved version of
49+
SQLite without forking it, although there are other goals as well.
50+
51+
LumoSQL is supported by the [NLNet Foundation](https://nlnet.nl).
52+
53+
If you are interesting in contributing to LumoSQL please see [CONTRIBUTING](/CONTRIBUTING.md).
54+
55+
56+
57+
LumoSQL Interfaces Are Almost the Same as SQLite
58+
================================================
59+
60+
Your interaction with the LumoSQL interface (commandline, PRAGMAs and API) is
61+
almost identical to SQLite. You use the same APIs, the same command shell
62+
environment, the same SQL statements, and the same PRAGMAs to work with the
63+
database created by LumoSQL as you would if you were using SQLite.
64+
65+
To learn how to use SQLite, see the [SQLite Documentation](https://sqlite.org/docs.html).
66+
67+
That said, there are a few small differences between the two interfaces.
68+
69+
# Building and Installing LumoSQL
70+
71+
## Directory layout
72+
73+
In order to build LumoSQL and SQLite and to used different versions of the LMDB
74+
library, we use the following directory layout:
75+
76+
```
77+
.
78+
├── bld-LMDB_?.?.? Build artifacts for LumoSQL (src and src-lmdb)
79+
├── bld-SQLite-?.?.? Build artifacts for sqlite (src-sqlite)
80+
├── LICENSES License files, in line with https://reuse.software/spec/
81+
├── lmdb-backend C source code to use SQLite with an LMDB backend
82+
├── src-lmdb Clone of LMDB source code
83+
├── src-sqlite Clone of sqlite.org git mirror
84+
└── tool Cut down version of speedtest.tcl
85+
```
86+
87+
## Linux/Unix
88+
89+
90+
### Build environment
91+
92+
On Ubuntu 18.0.4 LTS, Debian Stable (buster), and on any reasonably recent
93+
Debian or Ubuntu-derived distribution, you need only:
94+
95+
```sh
96+
sudo apt install git build-essential tcl
97+
sudo apt build-dep sqlite3
98+
```
99+
100+
(`apt build-dep` requires `deb-src` lines uncommented in /etc/apt/sources.list).
101+
102+
On Fedora 30, and on any reasonably recent Fedora-derived distribution:
103+
104+
```sh
105+
sudo dnf install --assumeyes \
106+
git make gcc ncurses-devel readline-devel glibc-devel autoconf tcl-devel
107+
```
108+
109+
The maintainers test building LumoSQL on Debian, Fedora, Gentoo and Ubuntu.
110+
Container images with the dependencies installed are available at
111+
<https://quay.io/repository/keith_maxwell/lumosql-build> and the build steps are
112+
in <https://github.com/maxwell-k/containers>.
113+
114+
### Using the Makefile tool
115+
116+
Start with a clone of this repository as the current directory:
117+
118+
```git clone https://github.com/LumoSQL/LumoSQL.git```
119+
120+
To build either (a) specific versions of SQLite or (b) sqlightning using
121+
different versions of LMDB, use commands like those below changing the version
122+
numbers to suit. A list of tested version numbers is in the table
123+
[below](#which-lmdb-version).
124+
125+
```sh
126+
make bld-SQLite-3.7.17
127+
make bld-LMDB_0.9.9
128+
```
129+
# Running LumoSQL
130+
131+
libraries and a command line shell are built with the following names:
132+
133+
```lumosql```
134+
135+
This is the command line shell. It operates identically to the SQLite sqlite3 shell.
136+
137+
```liblumosql```
138+
139+
This is the library that provides the LumoSQL SQL interface. It is the equivalent of the SQLite libsqlite3 library.
140+
141+
## Windows
142+
143+
LumoSQL is not supported on Windows as of March 2020. We are aiming for May 2020. Want to help?
144+
145+
## Android
146+
147+
LumoSQL is not supported on Android as of March 2020. We are aiming for July 2020. Want to help?
148+
149+
# Speed tests / benchmarking
150+
151+
To benchmark a single binary takes approximately 4 minutes to complete depending
152+
on hardware.
153+
154+
The instructions in this section explain how to benchmark four different
155+
versions:
156+
157+
| V. | SQLite | LMDB | Repository | Report filename |
158+
| --- | ------ | ------ | ---------- | ------------------ |
159+
| A. | 3.7.17 | - | SQLite | SQLite-3.7.17.html |
160+
| B. | 3.30.1 | - | SQLite | SQLite-3.30.1.html |
161+
| C. | 3.7.17 | 0.9.9 | LumoSQL | LMDB_0.9.9.html |
162+
| D. | 3.7.17 | 0.9.16 | LumoSQL | LMDB_0.9.16.html |
163+
164+
To benchmark the four versions above use:
165+
166+
```sh
167+
make benchmark
168+
```
169+
170+
The "Repository" column means:
171+
172+
<dl>
173+
<dt>SQLite</dt>
174+
<dd>
175+
176+
<https://github.com/sqlite/sqlite>
177+
178+
</dd>
179+
<dt>LumoSQL</dt>
180+
<dd>
181+
182+
<https://github.com/LumoSQL/LumoSQL> (this repository)
183+
184+
</dd>
185+
</dl>
186+
187+
# Which LMDB version?
188+
189+
`mc_orig` was removed and `mc_backup` added to `mdb.c` in
190+
<https://github.com/LMDB/lmdb/commit/be47ca766713f55e5b3abd18120514fdad7d90f2>
191+
first released in `LMDB_0.9.7` on 14 August 2013. `LMDB_0.9.8` was 9 September
192+
2013 and `LMDB_0.9.9` was 24 October 2013.
193+
<https://github.com/LMDB/sqlightning/commit/58b473f3d5570fca94b88398e0e4314208a077cd>
194+
adapted `sqlightning` to this change on 12 September 2013. So first try
195+
`LMDB_0.9.8`, but this fails with:
196+
`sqlite3.c:38156:2: error: unknown type name ‘mdb_hash_t’`.
197+
198+
Likely need
199+
[this commit](https://github.com/LMDB/lmdb/commit/01dfb2083dd690707a062cabb03801bfad1a6859),
200+
found through a
201+
[GitHub comparison](https://github.com/LMDB/lmdb/compare/LMDB_0.9.8...LMDB_0.9.9).
202+
203+
| Tag | Date | Compiles | Speed test | Files | Ins. | De. |
204+
| ----------- | ---------- | -------- | ---------- | ----: | ---: | --: |
205+
| LMDB_0.9.8 | 2013-09-09 || - | - | - | - |
206+
| LMDB_0.9.9 | 2013-10-24 ||| 6 | 577 | 540 |
207+
| LMDB_0.9.10 | 2013-11-12 ||| 5 | 216 | 121 |
208+
| LMDB_0.9.11 | 2014-01-15 ||| 6 | 443 | 273 |
209+
| LMDB_0.9.12 | 2014-06-18 ||| 12 | 516 | 333 |
210+
| LMDB_0.9.13 | 2014-06-18 ||| 3 | 28 | 22 |
211+
| LMDB_0.9.14 | 2014-09-20 ||| 23 | 2331 | 441 |
212+
| LMDB_0.9.15 | 2015-06-19 ||| 24 | 388 | 187 |
213+
| LMDB_0.9.16 | 2015-08-14 ||| 5 | 44 | 19 |
214+
| LMDB_0.9.17 | 2015-11-30 ||| 10 | 1072 | 565 |
215+
| LMDB_0.9.18 | 2016-02-05 ||| 24 | 303 | 57 |
216+
| LMDB_0.9.19 | 2016-12-28 ||| 6 | 684 | 447 |
217+
| LMDB_0.9.21 | 2017-06-01 ||| 23 | 81 | 50 |
218+
| LMDB_0.9.22 | 2018-03-22 ||| 23 | 74 | 58 |
219+
| LMDB_0.9.23 | 2018-12-19 ||| 4 | 52 | 9 |
220+
| LMDB_0.9.24 | 2019-07-19 ||| 6 | 16 | 11 |
221+
222+
The [GitHub LMDB mirror](https://github.com/LMDB/lmdb/releases) does not include
223+
a release `LMDB_0.9.20`, releases before 0.9.8 are not shown.
224+
225+
<dl>
226+
<dt>Compiles</dt>
227+
<dd>✓ means the process documented above completes successfully.</dd>
228+
<dt>Speed test<dt>
229+
<dd>✓ means the cut down version of speed test passes in `./tool/speedtest.tcl`
230+
passes.</dd>
231+
<dt>Files</dt>
232+
<dd>The number of files changed between the previous release and this one, as
233+
reported by <code>git diff --shortstat</code>.</dd>
234+
<dt>Ins.</dt>
235+
<dd>The number of insertions as for the "Files" column.</dd>
236+
<dt>De.</dt>
237+
<dd>The number of deletions as for the "Files" column.</dd>
238+
</dl>
239+
240+
A **?** means that this has not been tested, and a **-** means that it is not
241+
applicable at present.
242+
243+
# References
244+
245+
- The
246+
[Fedora Spec file for "sqlite3"](https://apps.fedoraproject.org/packages/sqlite/sources/)
247+
lists dependencies.
248+
- The [documentation](https://sqlite.org/whynotgit.html#getthecode) linking to
249+
the [official SQLite GitHub mirror](https://github.com/sqlite/sqlite)
250+
- ["sqlightning" repository](https://github.com/LMDB/sqlightning)
251+
- Early benchmarking by Howard Chu of <https://pastebin.com/B5SfEieL> of 3.7.17
252+
- Benchmarking
253+
<https://github.com/google/leveldb/blob/master/benchmarks/db_bench_sqlite3.cc>

‎lumo-relevant-codebases.md

+169
Large diffs are not rendered by default.

‎lumo-relevant-knowledgebase.md

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
2+
<!-- SPDX-FileCopyrightText: 2020 The LumoSQL Authors -->
3+
<!-- SPDX-ArtifactOfProjectName: LumoSQL -->
4+
<!-- SPDX-FileType: Documentation -->
5+
<!-- SPDX-FileComment: Original by Dan Shearer, 2019 -->
6+
7+
8+
Table of Contents
9+
=================
10+
11+
* [Knowledge Relevant to LumoSQL](#knowledge-relevant-to-lumosql)
12+
* [List of SQLite Code-related Knowledge](#list-of-sqlite-code-related-knowledge)
13+
* [List of On-disk File Format-related Knowledge](#list-of-on-disk-file-format-related-knowledge)
14+
* [List of Relevant Benchmarking and Test Knowledge](#list-of-relevant-benchmarking-and-test-knowledge)
15+
* [List of Just a Few SQLite Encryption Projects](#list-of-just-a-few-sqlite-encryption-projects)
16+
* [List of from-scratch MySQL SQL and MySQL Server implementations](#list-of-from-scratch-mysql-sql-and-mysql-server-implementations)
17+
18+
Knowledge Relevant to LumoSQL
19+
=============================
20+
21+
LumoSQL has many antecedents and relevant codebases. This document is intended
22+
to be a terse list of published source code for reference of LumoSQL
23+
developers. Although it is stored with the rest of the LumoSQL documentation
24+
and referred to throughout, it is a standalone document.
25+
26+
Everything listed here is open source, except for software produced by
27+
sqlite.org or the commercial arm hwaci.com. There are many closed-source
28+
products that extend and reuse SQLite in various ways, none of which have been
29+
considered by the LumoSQL project.
30+
31+
# List of SQLite Code-related Knowledge
32+
33+
SQLite code has been incorporated into many other projects, and besides there are many other relevant key-value stores and libraries.
34+
35+
| Project | Last modified | Description |
36+
| ------------- | ------------- | --------|
37+
| [sqlightning](https://github.com/LMDB/sqlightning) | 2013 | SQLight ported to the LMDB key-value store |
38+
| [Original MDB Paper](https://www.openldap.org/pub/hyc/mdb-paper.pdf) | 2012 | Paper by Howard Chu describing the motivations, design and constraints of the LMDB key-value store |
39+
| [SQLHeavy](https://github.com/btrask/sqlheavy) | 2016 | sqlightning updated, and ported to LevelDB, LMDB, RocksDB and more, with a key-value store library abstraction |
40+
| [libkvstore](https://github.com/btrask/libkvstore) | 2016 | The k-v store abstraction library used by SQLHeavy |
41+
| [SQLite 4](https://sqlite.org/src4/tree?ci=trunk) | 2014 | Abandoned new version of SQLite with improved backend support and other features |
42+
| [Sleepycat/Oracle BDB](https://fossies.org/linux/misc/db-18.1.32.tar.gz) | current | The original ubiquitous Unix K-V store, disused in open source since Oracle's 2013 license change; the API template for most of the k-v btree stores around. Now includes many additional features including full MVCC transactions, networking and replication. This link is a mirror of code from download.oracle.com, which requires a login |
43+
| [Sleepycat/Oracle BDB-SQL](https://fossies.org/linux/misc/db-18.1.32.tar.gz) | current | Port of SQLite to the Sleepycat/Oracle transactional bdb K-V store. As of 5th March 2020 this mirror is identical to Oracle's login-protected tarball for db 18.1.32 |
44+
| [rqlite](https://github.com/rqlite/rqlite) | current | Distributed database with networking and Raft consensus on top of SQLite nodes |
45+
| [Bedrock](https://github.com/Expensify/Bedrock) | current | WAN-replicated blockchain multimaster database built on SQLite. Has MySQL emulation |
46+
| [sql.js](https://github.com/kripken/sql.js/) | current | SQLite compiled to JavaScript WebAssembly through Emscripten |
47+
| [ActorDB](https://github.com/biokoda/actordb) | current | SQLite with a data sharding/distribution system across clustered nodes. Each node stores data in LMDB, which is connected to SQLite at the SQLite WAL layer |
48+
| [WAL-G](https://github.com/wal-g/wal-g) | current | Backup/replication tool that intercepts the WAL journal log for each of Postgres, Mysql, MonogoDB and Redis |
49+
| [sqlite3odbc](https://github.com/gdev2018/sqlite3odbc) | current | ODBC driver for SQLite by [Christian Werner](http://www.ch-werner.de/sqliteodbc/) as used by many projects including LibreOffice |
50+
| [Spatialite](https://www.gaia-gis.it/fossil/libspatialite/index)| current | Geospatial GIS extension to SQLite, similar to PostGIS |
51+
| [Gigimushroom's Database Backend Engine](https://github.com/gigimushroom/DatabaseBackendEngine)|2019| A good example of an alternative BTree storage engine implemented using SQLite's Virtual Table Interface. This approach is not what LumoSQL has chosen for many reasons, but this code demonstrates virtual tables can work, and also that storage engines implemented at virtual tables can be ported to be LumoSQL backends.|
52+
53+
# List of On-disk SQLite Format-related Knowledge
54+
55+
The on-disk file format is important to many SQLite use cases, and introspection tools are both important and rare. Other K-V stores also have third-party on-disk introspection tools. There are advantages to having investigative tools that do not use the original/canonical source code to read and write these databases. The SQLite file format is promoted as being a stable, backwards-compatible transport (recommend by the Library of Congress as an archive format) but it also has significant drawbacks as discussed elsewhere in the LumoSQL documentation.
56+
57+
| Project | Last modified | Description |
58+
| ------- | ------------- | ----------- |
59+
| [A standardized corpus for SQLite database forensics](https://www.sciencedirect.com/science/article/pii/S1742287618300471) | current | Sample SQLite databases and evaluations of 5 tools that do extraction and recovery from SQLite, including Undark and SQLite Deleted Records Parser |
60+
| [FastoNoSQL](https://github.com/fastogt/fastonosql) | current | GUI inspector and management tool for on-disk databases including LMDB and LevelDB |
61+
| [Undark](https://github.com/inflex/undark) | 2016 | SQLite deleted and corrupted data recovery tool |
62+
| [SQLite Deleted Records Parser](https://github.com/mdegrazia/SQLite-Deleted-Records-Parser) | 2015 | Script to recover deleted entries in an SQLite database |
63+
| [lua-mdb](https://github.com/catwell/cw-lua/tree/master/lua-mdb) | 2016 | Parse and investigate LMDB file format |
64+
65+
(The forensics and data recovery industry has many tools that diagnose SQLite
66+
database files. Some are open source but many are not. A list of tools commonly
67+
cited by forensics practicioners, none of which LumoSQL has downloaded or tried
68+
is: Belkasoft Evidence Center, BlackBag BlackLight, Cellebrite UFED Physical
69+
Analyser, DB Browser for SQLite, Magnet AXIOM and Oxygen Forensic Detective.)
70+
71+
# List of Relevant SQL Checksumming-related Knowledge
72+
73+
| Project | Last modified | Description |
74+
| ------- | ------------- | ----------- |
75+
| [eXtended Keccak Code Package](https://github.com/XKCP/XKCP) | current | Code from https://keccak.team for very fast peer-reviewed hashing |
76+
| [SQL code for Per-table Multi-database Solution](https://www.periscopedata.com/blog/hashing-tables-to-ensure-consistency-in-postgres-redshift-and-mysql) | 2014 | Periscope's SQL row hashing solution for Postgres, Redshift and MySQL |
77+
| [SQL code for Public Key Row Tracking](https://www.percona.com/blog/2018/10/12/track-postgresql-row-changes-using-public-private-key-signing/) | 2018 | Percona's SQL row integrity solution for Postgresql using public key crypto |
78+
79+
# List of Relevant Benchmarking and Test Knowledge
80+
81+
Benchmarking is a big part of LumoSQL, to determine if changes are an improvement. The trouble is that SQLite and other top databases are not really benchmarked in realistic and consistent way, despite SQL server benchmarking using tools like TPC being an obsessive industry in itself, and there being myriad of testing tools released with SQLite, Postgresql, MariaDB etc. But in practical terms there is no way of comparing the most-used databases with each other, or even of being sure that the tests that do exist are in any way realistic, or even of simply reproducing results that other people have found. LumoSQL covers so many codebases and use cases that better SQL benchmarking is a project requirement. Benchmarking and testing overlap, which is addressed in the code and docs.
82+
83+
The well-described [testing of SQLite](https://sqlite.org/testing.html) involves some open code, some closed code, and many ad hoc processes. Clearly the SQLite team have an internal culture of testing that has benefitted the world. However that is very different to reproducible testing, which is in turn very different to reproducible benchmarking, and that is even without considering whether the benchmarking is a reasonable approximation of actual use cases.
84+
85+
To highlight how poorly SQL benchmarking is done: there are virtually no test harnesses that cover encrypted databases and/or encrypted database connections, despite encryption being frequently required, and despite crypto implementation decisions making a very big difference in performance.
86+
87+
| Project | Last modified | Description |
88+
| ------- | ------------- | ----------- |
89+
| [Dangers and complexity of sqlite3 benchmarking](https://www.cs.utexas.edu/~vijay/papers/apsys17-sqlite.pdf)| n/a | Helpful 2017 paper: "...changing just one parameter in SQLite can change the performance by 11.8X... up to 28X difference in performance" |
90+
| [sqllogictest](https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki)|2017 | [sqlite.org code](https://www.sqlite.org/sqllogictest/artifact/2c354f3d44da6356) to [compare the results](https://gerardnico.com/data/type/relation/sql/test) of many SQL statements between multiple SQL servers, either SQLite or an ODBC-supporting server |
91+
| [TCL SQLite tests](https://github.com/sqlite/sqlite/tree/master/test)|current| These are a mixture of code covereage tests, unit tests and test coverage. Actively maintained. |
92+
| [Yahoo Cloud Serving Benchmark](https://github.com/brianfrankcooper/YCSB/)| current | Benchmarking tool for K-V stores and cloud-accessible databases |
93+
| [Example Android Storage Benchmark](https://github.com/greenrobot/android-database-performance) | 2018 | This code is an example of the very many Android benchmarking/testing tools. This needs further investigation |
94+
| [Sysbench](https://github.com/akopytov/sysbench) | current | A multithreaded generic benchmarking tool, with one well-supported use case being networked SQL servers, and [MySQL in particular](https://www.percona.com/blog/2019/04/25/creating-custom-sysbench-scripts/) |
95+
96+
97+
# List of Just a Few SQLite Encryption Projects
98+
99+
Encryption is a major problem for SQLite users looking for open code. There are no official implementations in open source, although the APIs are documented (seemingly by an SCM mistake years ago (?), see sqlite3-dbx below) and most solutions use the SQLite extension interaface. This means that there are many mutually-incompatible implementations, several of them seeming to be very popular. None appear to have received encryption certification (?) and none seem to publish test results to reassure users about compatibility with SQLite upstream or with the file format. Besides the closed source solution from sqlite.org, there are also at least three other closed source options not listed here. This choice between either closed source or fragmented solutions is a poor security approach from the point of view of maintainance as well as peer-reviewed security. This means that SQLite in 2020 does not have a good approach to privacy.
100+
101+
| Project | Last modified | Description |
102+
| ------- | ------------- | ----------- |
103+
| [SQLite Encryption Extension](https://www.sqlite.org/see/doc/release/www/readme.wiki)(SEE)| current | Info about the proprietary, closed source official SQLite crypto solution, illustrating that there is little to be compatible with in the wider SQLite landscape. This is a standalone product. The API is published and used by some open source code. |
104+
| [SQLCipher](https://github.com/sqlcipher/sqlcipher) | current | Adds at-rest encryption to SQLite [at the pager level](https://www.zetetic.net/sqlcipher/design/), using OpenSSL (the default) or optionally other providers. Uses an open core licensing model, and the less-capable open source version is BSD licensed with a requirement that users publish copyright notices. Uses the SEE API. |
105+
| [sqleet](https://github.com/resilar/sqleet) | current | Implements SHA256 encryption, also at the pager level. Public Domain (not Open Source, similar to SQLite) |
106+
| [sqlite3-dbx](https://github.com/newsoft/sqlite3-dbx) | kinda-current | Accidentally-published but unretracted code on sqlite.org fully documents crypto APIs used by SEE |
107+
| [SQLite3-Encryption](https://github.com/darkman66/SQLite3-Encryption) | current | No crypto libraries (DIY crypto!) and based on the similar-sounding SQLite3-with-Encryption project |
108+
| [wxSqlite3](https://github.com/utelle/wxsqlite3/) | current | wxWidgets C++ wrapper, that also implements SEE-equivalent crypto. Licensed under the LGPL |
109+
110+
... there are many more crypto projects for SQLite.
111+
112+
# List of from-scratch MySQL SQL and MySQL Server implementations
113+
114+
If we want to make SQLite able to process MySQL queries there is a lot of existing code in this area to consider. There are at least 80 projects on github which implement some or all of the MySQL network-parse-optimise-execute SQL pathway, a few of them implement all of it. None so far reviewed used MySQL or MariaDB code to do so. Perhaps that is because the SQL processing code alone in these databases is many times bigger than the whole of SQLite, and it isn't even clear how to add them to this table if we wanted to. Only a few of these projects put a MySQL frontend on SQLite, but two well-maintained projects do, showing us two ways of implementing this.
115+
116+
| Project | Last modified | Description |
117+
| ------- | ------------- | ----------- |
118+
| [Bedrock](https://github.com/Expensify/Bedrock) | current | The MySQL compatibility seems to be popular and is actively supported but it is also small. It speaks the MySQL/MariaDB protocol accurately but doesn't seem to try very hard to match MySQL SQL language semantics and extensions, rather relying on the fact that SQLite substantially overlaps with MySQL. |
119+
| [TiDB](https://github.com/pingcap/tidb/) | current | Distributed database with MySQL emulation as the primary dialect and referred to throughout the code, with frequent detailed bugfixes on deviations from MySQL SQL language behaviour. |
120+
| [phpMyAdmin parser](https://github.com/phpmyadmin/sql-parser) | current | A very complete parser for MySQL code, demonstrating that completeness is not the unrealistic goal some claim it to be |
121+
| [Go MySQL Server](https://github.com/src-d/go-mysql-server) | current | A MySQL server written in Go that executes queries but mostly leaves the backend for the user to implement. Intended to put a compliant MySQL server on top of arbitary backend sources. |
122+
| [ClickHouse MySQL Frontend](https://github.com/ClickHouse/ClickHouse/tree/146109fe27074229a38cd704d60f23ec7bd2ed67/base/mysqlxx) | current | Yandex' [Clickhouse](https://clickhouse.tech/) has a MySQL frontend.|

0 commit comments

Comments
 (0)
Please sign in to comment.