1 Introduction

We run two types of benchmarks against Codex:

  1. Parallel Uploads. A multithreaded driver program generates \(80\)MB files made of random bytes and uploads them to Codex via the /upload endpoint. We use \(2\) threads for the SQL store (enough to show its performance issues), and do \(2\) and then \(10\) threads on the FS store to see if it can cope with that.

  2. Microbenchmark. A nim-based driver program generates random \(64\)KB blocks and stores those in the repostore by calling putBlock. We measure the time elapsed between \(1\,282\) such inserts at a time, which correspond to \(80\)MB of inserted blocks.

If the blockstore (or the underlying datastore) is the main bottleneck, then the curves for \(1\) and \(2\) should track one another in shape. Furthermore, this should allow us to understand performance characteristics of the underlying datastores better.

1.1 Metrics

For the parallel upload experiments, we track, for the Codex process:

  • time: this is the time, in seconds, that an upload takes to complete;
  • read_bytes: cumulative number of bytes read from disk;
  • write_bytes: cumulative number of bytes written to disk;
  • read_iops: cumulative number of read IO operations;
  • write_iops: cumulative number of write IO operations;
  • cpu_time: cumulative CPU time used by the process over the experiment (user + system). An increasing slope means more CPU usage. CPU percent would have probably been a better metric, but we can still see something with this;
  • io_wait: the value of delayacct_blkio_ticks, which is supposed to return “Aggregated block I/O delays, measured in clock ticks (centiseconds)”, but appears to stay constant at zero, so probably not measuring much.

For the microbenchmarks, we track time only.

Cooldown. To understand if performance issues are due to transient state, we may perform a cooldown test. Cooldown means we shutdown, and then resume the experiment using the same on-disk storage state. If performance degradation is due to things like growing queues or increased GC activity because of, say, a memory leak, then cooldown should make performance better for a while.

1.2 Hardware

We run each experiment on a separate VM with a standard configuration. These are \(4\)-core (vCPU) machines with \(80\)GB networked SSD drives.

log_root <- '../../experiment-logs'

2 SQL Store

sql_log_root <- log_root %/% 'sqlstore'
sql_pu_root <- sql_log_root %/% 'parallel-uploads' %/% 'experiments' # Parallel Upload
sql_mb_root <- sql_log_root %/% 'microbenchmarks' # Microbenchmarks

2.1 Concurrent, \(2\)-thread Upload

experiment_1 <- sql_pu_root %/% '20231025-224709-c2'
ttu_sql <- read_csv(experiment_1 %/% 'experiment.csv', show_col_types = FALSE)
ttu_proc_sql <- read_csv(experiment_1 %/% 'proc-stats.csv', show_col_types = FALSE)

Fig. 2.1 shows that Codex’s performance remains constant until upload \(100\), after which it starts degrading roughly linearly. A similar process can be observed in Fig. 2.2, where all indicators start do degrade at arond second \(2\,000\). We did not track the actual upload timestamps, but we can roughly calculate it by summing the upload durations and dividing by two (cause we have two threads):

inflection <- sum(ttu_sql |> filter(upload <= 103) |> pull(time)) / 2
inflection
## [1] 1737.584
plotly::ggplotly(time_plot(ttu_sql))

Figure 2.1: Parallel Upload (2 threads), SQL Store

Read behavior is particularly curious: up until around sample \(3\,500\), we have a steady flow of read operations (as evidenced by read_iops), but those incur zero read_bytes. This would be consistent with us consistently hitting the page cache. At around second \(3500\) we start reading from disk which could mean our reads are stepping out of the cache. This is consistent with a drop in write performance, which is now sharing disk bandwidth with reads. Regardless, it appears that this is consistent with something that was fitting in memory, and all of a sudden stops fitting. One speculative explanation for this would be a memory-mapped WAL file which has gotten too large.

Another interesting effect is that CPU usage also changes regime and increases over time, meaning the node is having to do extra work.

stats_plot(ttu_proc_sql) + geom_vline(xintercept=inflection, lty = 2, col = 'gray')
Parallel Upload (2 threads), SQL Store

Figure 2.2: Parallel Upload (2 threads), SQL Store

To see these effects better, we de-accumulate the metrics in Fig. 2.3

stats_plot(ttu_proc_sql, cumulative = FALSE) + geom_vline(xintercept=inflection, lty = 2, col = 'gray')
## Warning: Removed 6 rows containing missing values (`geom_line()`).
Parallel Upload (2 threads), SQL Store

Figure 2.3: Parallel Upload (2 threads), SQL Store

2.1.1 Cooldown

We attempt to see if performance degradation is due to transient state (e.g. chronos’ queue sizes growing too much) by doing a cooldown. Fig. 2.4, which shows actually worse performance than at the time we halt the first experiment, tells us it is not.

experiment_2 <- sql_pu_root %/% '20231026-003628-c2'
ttu_sql_cooldown <- read_csv(experiment_2 %/% 'experiment.csv', show_col_types = FALSE)
ttu_proc_sql_cooldown <- read_csv(experiment_2 %/% 'proc-stats.csv', show_col_types = FALSE)
time_plot(ttu_sql_cooldown) + ylim(c(0,NA))
Parallel Upload (2 threads), Post Cooldown, SQL Store

Figure 2.4: Parallel Upload (2 threads), Post Cooldown, SQL Store

2.2 \(80MB\) Microbenchmark

The microbenchmark (Fig. 2.5 performs similarly to the two-upload test, except insertion times are at about \(7\) seconds per file, or about twice as fast than the roughly \(15\) seconds per file seen with the upload test.

mb_sql <- read_csv(sql_mb_root %/% 'sqlstore-perf.csv', show_col_types = FALSE) |> 
  assign_colnames(c('upload', 'time')) |>
  mutate(time = time / 1000)
plotly::ggplotly(time_plot(mb_sql))

Figure 2.5: SQL Store, Microbenchmark

3 File System Store

The filesystem store appears to perform a lot more consistently, and no obvious performance degradation was seen under the VM tests, neither when running it under \(2\) (Fig. ??) or \(10\) (Fig. 3.3) simultaneous uploads. Read loads are always close to zero (Fig. 3.2) and there is a constant stream of read IO operations, which would be consistent with reading metadata from a small database so that such reads always land in the page cache.

fs_log_root <- log_root %/% 'fsstore'
fs_pu_root <- fs_log_root %/% 'parallel-uploads'# Parallel Upload
fs_mb_root <- fs_log_root %/% 'microbenchmarks' # Microbenchmarks

3.1 Concurrent, \(2\)-thread Upload

experiment_1_fs <- fs_pu_root %/% '2-threads'
ttu_2_fs <- read_csv(experiment_1_fs %/% 'upload-logs.csv', show_col_types = FALSE)
ttu_2_proc_fsl <- read_csv(experiment_1_fs %/% 'process-stats.csv', show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
time_plot(ttu_2_fs |> filter(time > 10)) + ylim(c(0, NA))
Concurrent, $2$-thread Upload

Figure 3.1: Concurrent, \(2\)-thread Upload

stats_plot(ttu_2_proc_fsl)
## Warning: Removed 6 rows containing missing values (`geom_line()`).
Concurrent, 2-thread Upload

Figure 3.2: Concurrent, 2-thread Upload

3.2 Concurrent, \(10\)-thread Upload

experiment_2_fs <- fs_pu_root %/% '10-threads'
ttu_10_fs <- read_csv(experiment_2_fs %/% 'upload-logs.csv', show_col_types = FALSE)
ttu_10_proc_fsl <- read_csv(experiment_2_fs %/% 'process-stats.csv', show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
time_plot(ttu_10_fs |> filter(time > 90)) + ylim(c(0, NA))
Concurrent, $10$-thread upload

Figure 3.3: Concurrent, \(10\)-thread upload

stats_plot(ttu_10_proc_fsl) +
  ggtitle('Concurrent, 10-thread Upload')
## Warning: Removed 6 rows containing missing values (`geom_line()`).

3.3 \(80MB\) Microbenchmark

mb_fs <- read_csv(fs_mb_root %/% 'fs-microbench.csv', show_col_types = FALSE) |> 
  assign_colnames(c('upload', 'time')) |>
  mutate(time = time / 1000)
time_plot(mb_fs) + ylim(c(0, NA)) + ggtitle('FS Store, Microbenchmark')