Three Ways to Import Data into Dolt

SQL
8 min read

Dolt is the first database that versions data like Git versions code. We focused on a few key areas for improving Dolt this year: 1) sharpening MySQL feature parity, 2) smoothing the first hour experience, and 3) chasing MySQL’s performance. Today we will talk about importing data into Dolt, which for some users is both a performance bottleneck and a negative first hour experience. This blog discusses what we learned building a new system for measuring import performance. We share improved intuitions about which imports are fast, which are slow and why.

We start with a survey of different import options before digging into the results of our benchmarks.

Bulk Loading Data into Dolt#

An “import” loads data from an external format into a database. We bulk insert data when restoring databases from dumps, converting a MySQL database to a Dolt database, converting a set of CSV files into a database, or simply adding a large set of changes into a preexisting database.

There are three ways to achieve the same result: SQL dumps (i.e. dolt sql < dump.sql), dolt table import, and LOAD DATA INFILE.

SQL Dump#

The first way to import data simply runs a SQL script:

$ dolt sql < dump.sql`

dolt sql runs every line of dump.sql when fed through standard input. Queries in a SQL script generally do not have to be inserts, but a import dump.sql will look something like this:

create table xy (
    x int primary key,
    y varchar(32)
);
insert into xy values
    (1, '1'),
    (2, '2'),
    (3, '3'),
    (4, '4');
insert into xy values
    (5, '5'),
    (6, '6'),
    (7, '7'),
    (8, '8');

A dump defines and then inserts data into tables. Queries are separated by semi-colons for parsing, and new lines for convenience.

All together, the import looks something like this:

$ dolt init
$ dolt sql < dump.sql
Query OK, 4 rows affected (0.00 sec)
Query OK, 4 rows affected (0.00 sec)
$ dolt sql -q "select * from xy"
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
+---+---+

If you prefer strings rather than files, the batch (-b) flag uses a slightly different syntax for the same effect:

$ dolt sql -b -q "
insert into xy values (9, '9');
insert into xy values (10, '10');
"

Table import#

The second way to bulk load data uses a Dolt command line function: dolt table import. I usually divide this path into two steps. First, I create the table schema:

dolt sql -q "create table xy (x int primary key, y varchar(32));"

dolt table import can also infer schemas, but I prefer the SQL-native format for setting a schema’s columns, primary key, and any indexes.

The second step either -c creates, -u updates, or -r replaces the table. These three will have the same result on our empty table, but a replace will be repeatable on a non-empty table:

cat data.csv
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
dolt table import -r xy data.csv
Rows Processed: 8, Additions: 8, Modifications: 0, Had No Effect: 0
Import completed successfully.

LOAD DATA INFILE#

The previous methods import data using Dolt’s command line interface. The final requires a server and client connection. The LOAD DATA command works on Dolt and MySQL provided the server has been configured to allow it, which isn’t the default in either Dolt or MySQL (see docs for details).

First, we start a server in a new directory that holds our data file:

$ mkdir mydb
$ cd mydb
$ echo <<EOF
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
EOF > data.csv
$ dolt sql-server
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="info"|S="/tmp/mysql.sock"

Next, we connect to this database. We whitelist my local directory /Users/max-hoffman/mydb for access when establishing the client connection.

$ mysql -u root -h localhost --port 3306 --load-data-local-dir /Users/max-hoffman/mydb mydb

Inside the SQL shell, we will reconfigure the server to permit LOAD DATA INFILE commands, which requires root authorization:

mysql> SET GLOBAL local_infile=1;
Query OK, 1 row affected (0.00 sec)

Finally, we can create our table and import data.csv:

Mysql> create table xy (x int primary key, y varchar(32))
Query OK, 0 rows affected (0.02 sec)

mysql> LOAD DATA LOCAL INFILE '/Users/max-hoffman/mydb/data.csv'
       INTO TABLE xy
       FIELDS TERMINATED BY ',' ENCLOSED BY ''
       LINES TERMINATED BY '\n'
       IGNORE 1 LINES;
Query OK, 8 rows affected (0.01 sec)

My file is located at /Users/max-hoffman/mydb/data.csv, and the additional configuration indicates 1) the fields are separated by commas, 2) rows are separated by line breaks, and 3) we should not import the first (header) row.

MySQL has a wealth of options to configure permissions and execution of LOAD DATA. If you find Dolt missing a specific configuration option or pathway, let us know and we will be happy to add it.

Benchmarking Performance#

The Good#

The good news is that Dolt trails MySQL’s import performance by roughly the same margin as the rest of our benchmarks, about 2x. This number will scale naturally as we make all of Dolt faster. But there is room to tighten import independently if the 2x ends up bottlenecking users!

test_namedetailrow_cntsortedmysql_timesql_multcli_mult
col typedatetime20000010.872.212.31
col typevarchar20000010.752.42.77
config width2 cols20000010.811.731.8
config width32 cols20000012.071.692.43
config width8 cols20000011.021.782.47
pk typefloat20000011.021.771.49
pk typeint20000010.831.761.75
pk typevarchar20000011.721.922.03
row count1.6mm16000001622.1
row count400k40000011.541.91.99
row count800k80000013.021.952.06
secondary indexfour index20000014.521.431.51
secondary indexno secondary20000010.971.721.96
secondary indexone index20000011.311.761.89
secondary indextwo index20000012.441.521.59
sortingshuffled 1mm100000006.651.91.94
sortingsorted 1mm100000016.262.032.04

Each row is a particular import configuration as described by the name, detail, row count, and a sorting flag. The next column is the mysql time to perform the import. And finally the last two columns are dolt time multipliers for LOAD DATA INFILE from sql, and dolt table import from the command line. For example, the row count, 1.6M import test takes 6.0 seconds in MySQL, 12.0 seconds in DoltSQL, and about 12.6 seconds on the Dolt CLI.

The Bad#

Both MySQL and Dolt fall short importing a certain kind of dump file. Consider a “batched” import:

insert into xy values
    (1, '1'),
    (2, '2'),
    (3, '3'),
    (4, '4');

vs and “unbatched” import:

insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');

Batching runs 30x faster in MySQL and 130x faster in Dolt, even though the files themselves are almost the same. Moving a row to disk is fast, but Dolt makes strict isolation and durability guarantees between queries. Improving lifecycle machinery to streamline transaction throughput is hard.

But don’t fret! We created a helper tool with a workaround. This insert batcher reduces the number of transaction setup/teardowns by combining insert queries.

After installing Go we can create a batched file:

$ go version
go version go1.19 darwin/amd64

$ go install github.com/max-hoffman/insert-batcher

$ cat dump.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');
insert into xy values (5, '5');
insert into xy values (6, '6');
insert into xy values (7, '7');
insert into xy values (8, '8');

$ go run github.com/max-hoffman/insert-batcher \
    -in dump.sql \
    -out batch.sql \
    -b 4

$ cat batch.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1'), (2, '2'), (3, '3'), (4, '4');
insert into xy values (5, '5'), (6, '6'), (7, '7'), (8, '8');

The default batching factor -b of 500 should amortize the transaction overhead for most imports.

serverdetailrow_cnttime
doltbatch sql100000.14
doltby line sql1000017.96
mysqlbatch sql100000.1
mysqlby line sql100003.58

The Ugly#

Blob stands for “binary large object” file, and in MySQL corresponds to TEXT, JSON, and BINARY column types. We distinguish between blobs and VARCHAR for sizing reasons. A block of memory typically contains a range of rows, but a blob typically uses several blocks worth of memory on its own. Instead of disrupting the compactness of the regular tree, we write blobs to separate storage. Data blocks contain rows of mostly inlined values, and reference blob files written elsewhere.

This is an example of a schema that will be written inline. All fields will be layed out next to each other in a row-oriented storage.

create table xy (
  x int primary key,
  y varchar(30),
  z varchar(30),
  w varchar(30)
);

This schema writes every y field as a separate file:

create table xy (
  x int primary key,
  y text,
  z varchar(30),
  w varchar(30)
);

When we started this work, the blob schema was more than 10x slower on import!

detailrow_cnttime
no blob2000001.45
1 blob20000018.43
2 blobs20000061.77

There are some good reasons why blobs take longer to write than varchar. We write blobs as a tree, not a linear sequence of bytes. The blob tree nodes are pooled with the rest of the table’s tree nodes, and so blobs usually expand the total node number by a factor of 10-100x. If your TEXT fields fit as a VARCHAR, you are always better off sticking to VARCHAR!

Even then, many MySQL dumps include TEXT columns that we need to support for compatibility. We discovered some no-so-good reasons why blobs were slow, and managed to streamline this edge case in a recent PR:

detailrow_cnttime
no blob2000001.25
1 blob2000002.23
2 blobs2000002.64

Stay tuned for a follow up blog about how we optimized blob import throughput by 10x.

Summary#

We gave a summary of how to import data into Dolt. We discussed how fast Dolt is compared to MySQL, which paths are fast, which are prone to foot-guns, and which are slow and require more consideration.

Next time, we will follow-up with a performance-focused blog detailing how we streamlined a particularly hairy edge case: TEXT column imports.

We are tracking performance changes on release for imports. If anyone finds gaps in our testing, or have particular edge cases they would like us to address, we encourage you to reach out on Discord, GitHub, or Twitter!

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.