How Dolt Types Work

10 min read

UPDATED FEBRUARY 10, 2021: Updated the final table with the types that have been added to Dolt since the article was first written.

When we started on Dolt, our goal was to apply Git’s idea of versioning to data. Whereas Git versions files, Dolt versions tables. As the creator of a database, you have the choice of how to represent any value. We chose to adopt the idea of having a schema for each of our tables, meaning that any one column has a definition of the type of data that it represents, and all cells under that column adhere to that type.

As an example, let’s take a look at the value 1, and let’s store it in a column we’ll refer to as val.

val
1

If someone were to retrieve the value, what would it represent? Would it represent a count? Would it instead represent a true/false value? Perhaps it represents a character encoded in the ASCII format to denote the start of a message header? It is ambiguous. This could be fixed by using a more descriptive column name.

Number Of Books
1
IV
1101

This now highlights another problem, in that we can add any values that we want, so we inserted the values IV (which is 4 in Roman numerals) and 1101 (which is 13 in binary). Each of these are considered valid values in the column by its name alone. By having a type, the database can verify that the value given conforms to the type, converting if need be, or rejecting if the value is not understood to map to the given type.

Number Of Books int32
1
4
13

This type information is even more handy in a database built for sharing like Dolt. Have you ever tried importing a CSV with N/A in a numerical column to represent NULL? We have. It’s more work than it should be. Moreover, the type information of a column communicates the schema creator’s intent. It’s like documentation that is built-in.

First there was Noms#

We initially built Dolt on top of Noms. Although we have made significant changes along the way, their implementation of types was our starting point. In Noms, each value is stored on disk as an array of bytes, and those bytes are prefixed with an integer.

Kind                    Kind
  ↓  Data→                ↓  Data→
-------------------------------------------------
| 1 |   |   |   |   |   | 3 |   |   |   |   |   |
-------------------------------------------------

This integer represents the kind, which is Noms’ way of stating the type that those bytes represent, and also how to interpret those bytes. Noms has a general-purpose type system, but for our purpose of collaborating on tabled data, we started using the three scalar types:

TypeDescription
boolrepresents either true or false
numberIEEE-754 double precision number
stringan array of characters

Each type in Noms has a few properties, such as defining equality, sortability, how to encode and decode from a byte array, and also how to hash the value. When we began working on Dolt, we added our own types into the mix, while also renaming number to float64.

TypeDescription
boolrepresents either true or false
float64IEEE-754 double precision number
int64integers from -9223372036854775808 to 9223372036854775807
stringan array of characters
uint64integers from 0 to 18446744073709551615
uuid128-bit value as defined by RFC 4122

And then there was SQL#

This was sufficient for a few months, until we began to add SQL support. SQL allows one to use the data in many exciting different ways, and also works as a standard and well-known interface to tabled data. We specifically chose the MySQL variant to target first due to its popularity, and with it came a lot more types that we did not support.

SQL TypeSQL AliasesDescription
BIGINTintegers from -9223372036854775808 to 9223372036854775807
BIGINT UNSIGNEDintegers from 0 to 18446744073709551615
BINARYan array of bytes, up to a maximum of 255
BITdepending on bit depth, holds integers from 0 to 18446744073709551615
BLOBan array of bytes, up to a maximum of 65535
CHARCHARACTERan array of character, up to a maximum of 255
DATEdates from 1000-01-01 to 9999-12-31
DATETIMEdates and times from 1000-01-01 00:00:00 to 9999-12-31 23:59:59
DECIMALFIXED, DEC, NUMERICnumber with a specifiable number of digits, both total and fractional
DOUBLEDOUBLE PRECISION, REALIEEE-754 double precision number
ENUMdefinable collection of allowed strings, with a maximum of 65535 entries
FLOATIEEE-754 single precision number
GEOMETRYa supertype that can hold a point, linestring, or polygon
GEOMETRYCOLLECTIONa collection of any geometry-type object
INTINTEGERintegers from -2147483648 to 2147483647
INT UNSIGNEDINTEGER UNSIGNEDintegers from 0 to 4294967295
JSONrepresents the JavaScript Object Notation format for a string
LINESTRINGa collection of points forming a line
LONGBLOBan array of bytes, up to a maximum of 4294967295
LONGTEXTan array of characters, up to a maximum of 4294967295
MEDIUMBLOBan array of bytes, up to a maximum of 16777215
MEDIUMINTintegers from -8388608 to 8388607
MEDIUMINT UNSIGNEDintegers from 0 to 16777215
MEDIUMTEXTLONG, LONG VARCHARan array of characters, up to a maximum of 4194303 (for utf8 encoding)
MULTILINESTRINGa collection of linestrings
MULTIPOINTa collection of points
MULTIPOLYGONa collection of polygons
NCHARNATIONAL CHARshortcut for appending CHARACTER SET utf8mb3 to CHAR
NVARCHARNATIONAL CHARACTER VARYINGshortcut for appending CHARACTER SET utf8mb3 to VARCHAR
POINTrepresents a point in space
POLYGONa collection of points forming a shape
SETdefinable collection of allowed strings, with a max of 64 entries, and allows combinations
SMALLINTintegers from -32768 to 32767
SMALLINT UNSIGNEDintegers from 0 to 65535
TEXTan array of characters, up to a maximum of 16383 (for utf8 encoding)
TIMEtimes from -838:59:59 to 838:59:59
TIMESTAMPdates and times from 1970-01-01 00:00:01 to 2038-01-19 03:14:07
TINYBLOBan array of bytes, up to a maximum of 255
TINYINTBOOLEAN, BOOLintegers from -128 to 127
TINYINT UNSIGNEDintegers from 0 to 255
TINYTEXTan array of characters, up to a maximum of 63 (for utf8 encoding)
VARBINARYan array of bytes, up to a maximum of 65535
VARCHARCHARACTER VARYINGan array of characters, up to a maximum of 16383 (for utf8 encoding)
YEARintegers from 1901 to 2155, along with 0

Our first solution was to map any types that are more restrictive to their less restrictive variants, such as a TINYINT (int8) to a BIGINT (int64), and store them as our closest Noms-equivalent type. Although the user lost a bit of type information, it would still allow them to export their data from MySQL and import it into Dolt. This solution did not work for other types, such as DATETIME, as there was no clear analog in Dolt, and thus we had to add two more types to our Noms-based backend: timestamp and inlineblob.

TypeDescription
boolrepresents either true or false
float64IEEE-754 double precision number
inlinebloban array of bytes, up to a maximum of 65535 bytes
int64integers from -9223372036854775808 to 9223372036854775807
stringan array of characters
timestampdates and times from -200000000-01-01 00:00:00 UTC to 200000000-12-31 23:59:59 UTC
uint64integers from 0 to 18446744073709551615
uuid128-bit value as defined by RFC 4122

This allowed us to map to many of the common types seen in MySQL databases.

SQL TypeNoms Type
BIGINTint64
BIGINT UNSIGNEDuint64
BINARY
BIT
BLOB
CHARstring
DATEtimestamp
DATETIMEtimestamp
DECIMAL
DOUBLEfloat64
ENUM
FLOATfloat64
GEOMETRY
GEOMETRYCOLLECTION
INTint64
INT UNSIGNEDuint64
JSON
LINESTRING
LONGBLOB
LONGTEXTstring
MEDIUMBLOB
MEDIUMINTint64
MEDIUMINT UNSIGNEDuint64
MEDIUMTEXTstring
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHAR
NVARCHAR
POINT
POLYGON
SET
SMALLINTint64
SMALLINT UNSIGNEDint64
TEXTstring
TIME
TIMESTAMPtimestamp
TINYBLOB
TINYINTint64
TINYINT UNSIGNEDuint64
TINYTEXTstring
VARBINARY
VARCHARstring
YEAR
bool
inlineblob
uuid

Of note, both bool and uuid do not have an equivalent in SQL, and thus are inaccessible as a backing Noms type when creating a table through SQL, since no types map to them. Although inlineblob would satisfy some of the binary/blob types, it would not work for LONGBLOB, so its integration with SQL types was postponed.

Enter go-mysql-server#

As work on the SQL engine continued, we began running into the issue that our Noms types worked a bit differently than MySQL’s types under some circumstances, and thus we worked to achieve complete parity with their types. Our goal was to be a fully drop-in replacement for MySQL, and that included all of the idiosyncrasies that may not be immediately apparent. This way, one could round-trip data between MySQL and Dolt without any surprises or loss of data. It would also allow anyone to gain the versioning capabilities of Dolt while keeping the workflow that has already been established with their MySQL integration.

We adopted a SQL engine built on top of our fork of the go-mysql-server project. In order to achieve full type compliance, most of the type code had to be rewritten. We referenced MySQL’s documentation, however we found a lot of little interactions that were not documented. For example, assigning the string 1 to an integer column (BIGINT, etc.) would parse the string as though it contained an integer. However, assigning the string 1 to a BIT column would parse the string as though it were an array of utf8-encoded characters, which it would then convert to bytes, giving a value of 49 in this example. Considering how similarly BIT behaves relative to the other integer types, this difference was not assumed to have existed until it was manually found.

We have a test framework for testing correctness, however, the focus of the correctness tests is less about types and more on query execution and features. This meant that a lot of manual testing with a MySQL instance was used to better understand how each type worked, writing tests for the observed behavior to guarantee our implementation’s parity.

Putting it all together in Dolt#

Once the SQL engine handled types properly, it was on to the Dolt side. In Dolt, a schema’s type was just a reference to the Noms type, and the decision was made to give Dolt its own concept of types, and for those types to handle how SQL and Noms types relate to Dolt. This way, Dolt did not need to map SQL types directly to similar Noms types, and could freely convert between the two representations. This allowed for many SQL types, such as SET, to map to a completely different Noms type, such as uint64, as the Dolt type could handle the uint64 as a bit-array, and map each bit to a SET entry. This also had another benefit: any Dolt types that do not exist in SQL now have an interface through which SQL can interact with them, retaining compatibility with old Dolt databases. For example, uuid does not have a parallel in SQL, and thus appears as a CHAR(36) when accessed through SQL, however still retains all of the type checking necessary for that type. In the future, if we decide to extend our SQL types to support uuids, then it’s as simple as changing the displayed SQL type, without changing anything in Noms. For the majority of SQL types, the Dolt type is simply a conversion from the SQL value to the closest Noms value that can uniquely represent that value, and vice-versa.

SQL TypeDolt TypeNoms Type
BIGINTnumber[i64]int64
BIGINT UNSIGNEDnumber[u64]uint64
BINARYinlineblob[bin, length]inlineblob
BITbit[depth]uint64
BLOBvarbinary[65535]blob
CHARvarstring[char]string
DATEdatetime[date]timestamp
DATETIMEdatetime[dt]timestamp
DECIMALdecimal[prec, scale]decimal
DOUBLEnumber[f64]float64
ENUMenum[entries]int64
FLOATnumber[f32]float64
GEOMETRY
GEOMETRYCOLLECTION
INTnumber[i32]int64
INT UNSIGNEDnumber[u32]uint64
JSON
LINESTRING
LONGBLOBvarbinary[4294967295]blob
LONGTEXTvarstring[text, 4294967295]string
MEDIUMBLOBvarbinary[16777215]blob
MEDIUMINTnumber[i24]int64
MEDIUMINT UNSIGNEDnumber[u24]uint64
MEDIUMTEXTvarstring[text, enc_length]string
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHARvarstring[char, length]string
NVARCHARvarstring[var, length]string
POINT
POLYGON
SETset[entries]uint64
SMALLINTnumber[i16]int64
SMALLINT UNSIGNEDnumber[u16]int64
TEXTvarstring[text, enc_length]string
TIMEtimeint64
TIMESTAMPdatetime[ts]timestamp
TINYBLOBvarbinary[255]blob
TINYINTnumber[i8]int64
TINYINT UNSIGNEDnumber[u8]uint64
TINYTEXTvarstring[text, enc_length]string
VARBINARYinlineblob[var, length]inlineblob
VARCHARvarstring[var, length]string
YEARyearint64

For a SMALLINT (int16), Dolt has a number type that converts it to an int64 for Noms to handle, and back to an int16 when retrieving the value, using the SQL engine to verify the value before storage. This way, a Dolt type does not need to worry about how to hash a value or perform any of the other operations, as the underlying Noms value already does that.

What’s still missing?#

This has allowed us to add the majority of types from MySQL, however there are still a few missing, such as: JSON, binary types, and GEOMETRY (including all subtypes). We also save character sets and collations on all relevant types, however they are currently ignored. We are working hard to bring you these missing types, so check our release notes on GitHub to stay updated! Better yet, if you need a new type, implement it and send us a pull request.

Conclusion#

Types are an integral part of a database meant for sharing data. They communicate intention, while also enforcing homogeneity to the data. It is imperative that we get types right, and our journey through building Dolt has landed us with a type implementation that we feel is stable, extensible, and ready for the challenges that our users will demand.

If you enjoyed reading this blog, why not checkout Dolt, or browse our repositories here on DoltHub?

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.