blog-static/content/blog/starbound.md
Danila Fedorin 15beddf96b Update btree article with new math delimiters
Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
2024-05-13 19:07:44 -07:00

97 lines
12 KiB
Markdown

---
title: A Look Into Starbound's File Formats
date: 2017-05-17 22:34:04.463912
tags: ["Starbound"]
---
After playing a few hours of Starbound, I decided that I wanted to mess around with its file format, perhaps make a save manager/swapper, or something. There wasn't a really clear idea in my head as to what I was going to do, but I wanted to start writing _something_ and see what happens. Looking up "starbound file formats" led me immediately [here](http://seancode.com/galileo/format/). I tried to follow the outline with my own implementation, but things weren't working. At all. So, I cracked open a hex editor. According to the website, I was supposed to see:
> Byte[4] Signature "BTDB"
However, this was not the case. Instead, a very similar `BTreeDB5` was what greeted me at the beginning of the file. Not good. The next links in my search results were [py-starbound](https://github.com/blixt/py-starbound), and its friend, by the same author, [go-starbound](https://github.com/blixt/go-starbound). Neither of these was in a language that I felt comfortable using for my project, so there was only one solution - to figure out the formats that they used, and try one more time to write my own.
__py-starbound__, nicely enough, actually has a file named `FORMATS.md`. This file proved very helpful, though slightly incomplete. One key piece of information was missing from it - how the `BTreeDB5` was organized beyond its header. The only thing in `FORMATs.md` on this subject was:
> This section will contain information on how to retrieve a value from a BTreeDB5 database.
Not very helpful. Before I go into what I managed to determine from the code, we may first take a look at one thing that we already know about the world format - it is a [B-Tree](https://en.wikipedia.org/wiki/B-tree).
### Binary Search Trees
The B-Tree is a generalization of a Binary Search Tree, or BST for short. Binary Search trees (and B-Trees in general) operate on data that can be ordered consistently, the simplest example being numbers. For instance, as an example, I'll be using a BST that holds integers. A BST is made up of nodes, objects that actually hold the pieces of data that the tree itself organizes.
In a BST, the nodes are organized in a simple way. Each node can have up to two _children_ (sub-nodes), and each of those can have up to two children, etc. The children are generally classified as _right_ and _left_. Conventionally, left children always have a value that is below (or comes before) the value of the node whose child they are (their _parent_), and right children have a bigger value.
In order to find the data in the tree, a program would first examine the very first node of the BST, called the _root_. If the data it's looking for is equal to that of the root, the search ends there. If the data is less, it moves on to the left child, and performs the same operation there. Similarly, if the data is larger, it the search moves on to the right child. This continues until either a node with the same value is found, or until there is no child left to examine - the algorithm reached the end of the tree.
The purpose of the binary search tree is to speed up lookups. Let's compare it to a simple list of values. To check if a value is there, a naive approach would go from the beginning to the end of the list, searching for the data. For instance, if we wanted to check whether the list `(1, 2, 3, 4, 5, 6, 7, 8, 9)` contained the number nine, we would have to look at the first element (one), then the second, all the way until the 9th. This is fine for small lists, but when the size of the lists become big enough, this operation becomes fairly slow. Let's now take a look at how the same operation would be performed in a binary tree.
Here's an example of a valid binary tree with the numbers one through nine.
```
(5)
(2) (7)
(1) (3) (6) (8)
(4) (9)
```
The root node of this tree is 5, its left child is 2, and its right child is 7. In order to check if the nine is in this tree, we only need to look at the five, then the seven, then the eight, and finally, the nine. We have, therefore, looked at only four of the numbers, instead of all nine. But let's look at another valid binary search tree.
```
(1)
(2)
(3)
(4)
(5)
...
```
In this tree, the root node is 1, and the right child is 2. None of the nodes have a left child. Looking through this tree would be as slow as looking through a list - we'd have to look through all the numbers before we find 9. No good.
__Although the average efficiency of a Binary Search Tree is \(O(\log n)\), meaning that for a tree with \(n\) nodes, it will only need to examine up to \(logn\) of these nodes, it can nonetheless be as inefficient as \(O(n)\), meaning that it will have to examine every node in the tree.__
This isn't good enough, and many clever algorithms have been invented to speed up the lookup of the tree by making sure that it remains _balanced_ - that is, it _isn't_ arranged like a simple list. Some of these algorithms include [Red-Black Trees](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree), [AVL Trees](https://en.wikipedia.org/wiki/AVL_tree), and, of course, B-Trees.
### B-Trees
B-Trees are a generalization of Binary Search Trees. That means that every Binary Search Tree is a B-Tree, but not all B-Trees are BSTs. The key difference lies in the fact that B-Trees' nodes aren't limited to having only two child nodes, and can also have more than one value.
Each B-Tree node is a sorted array of values. That is, instead of a single number like the BST that we've looked at, it has multiple, and these numbers _must_ be sorted. Below are some examples of B-Tree nodes:
```
(1)
(1, 2)
(2, 3, 5)
```
This raises the question: How do we find the next node to examine if the node doesn't contain our child? Before, we only had two choices, but now, our data can be bigger than one of the values and smaller than the other...what do we do?
This is solved using another property of B-Trees - the number of children of a node is always one more than the number of values in it. Thus, a node with one value will have two children, and a node with two will have three children. Thus, to find the next node we want to look for, we find the child that falls "between" the two values that our data lies between. To clarify, let's look at a simple B-Tree:
```
(10, 20, 30)
(1st) (2nd) (3rd) (4th)
```
If we were looking for the number 15, we'd look between the 10 and the 20, examining the 2nd node, and if we were looking for 45 we'd look past the 30, at the 4th node.
### Starbound B-Trees and BTreeDB5
The BTreeDB5 data structure uses something other than integers for its keys - it uses sequences of bytes. These bytes are compared in a very similar fashion to integers. The game first looks at the first number in the sequence of bytes (like the largest digit in an integer), and if that's the same, moves on to the next one. Also, Starbound B-Trees not only have the values, or _keys_, that they use to find data, but the data itself.
The "nodes" in the BTreeDB are called "blocks" and are one of three types - "index", "leaf", and "free" nodes. "Index" nodes are like the `(10, 20, 30)` node in the above example - they point to other nodes, but actually store no data themselves. The "leaf" nodes actually contain the data, and, if that data is longer than the maximum block size, "leaf" nodes contain the index of the next leaf node where the user might continue to read the data. The "free" nodes are simply free data, empty and ready for Starbound to fill them with something useful.
To be able to read a Starbound BTree, the first thing that needs to be done is to read the general information about the tree. For this, we read the very beginning of the file, called the _header_. [GitHub user blixt has made an excellent table of values that make up the header](https://github.com/blixt/py-starbound/blob/master/FORMATS.md#btreedb5). The ones that matter to us are `Whether to use root node #2 instead` (hereby referred to as "swap", `Root node #1 block index` and `Root node #2 block index`. We also need to know the key size and block size of the B-Tree to be able to jump around in the file.
Once the header has been parsed (this is an exercise left up to the reader), the next step is to find the root node. This is following exactly the general lookup algorithm for a B-Tree. The index in the file (by bytes) of a block is \(headerSize + blockSize \times n\), where \(n\) is the block number. Depending on the value of `swap` (whether to use the second root node), we use either the index of root node 1 or 2 for \(n\), and move to that position.
Next, we proceed to identify the node that we've found. The first two bytes in that node are either the ASCII values of 'F', 'L', or 'I', representing, you guessed it, "Free", "Leaf", or "Index". If the node is an index node, we need to search it for the next node we need to examine. To do so, we first read two values from the node, two 32-bit integers. The first is the number of keys in the block, and the next is the block number of the first child node.
The key -> child data is stored in the node as follows:
```
(1st child)(1st key)(2nd child)(2nd key) ... (nth child)
```
(The key is of the length given in the header, and the child number is a 32-bit integer)
As you can see, the number of children is one more than the number of keys, and it's easier to treat the first child block number as a separate value, and the rest as key -> child pairs, like:
```
(1st child) (1st key, 2nd child) ... ((n-1)th key, nth child)
```
Simply, if the value we're searching for is bigger than the first key only, we go to the second child, if it's bigger than the second key, we go to the third child, etc. If our value is not bigger than any of the keys, we go to the 1st child. After we move to the index of the new child, we once again examine its type, and if it's still "II", we repeat the process.
Once we reach a "leaf" node, we're very close. After the two ASCII characters describing its type, the leaf node will contain a 32-bit integer describing the number of key-data pairs it has. Each key-data pair is made up of the key, a variable integer describing the length of the data, and the data itself. We examine one pair after another, carefully making sure that we don't exceed the end of the block, located at \(headerSize + blockSize \times (n + 1) - 4\). The reason that we subtract four from this equation is to save space for the address of the next block. As I mentioned above, leaf nodes, if their data is bigger than the block size, contain the block number of the next leaf node to which we can continue if we reach the end of the current leaf. If we do reach the end of the current leaf, we read the 32-bit integer telling us the number of the next block, jump to its index, and, after reading the first two bytes to ensure it's actually a leaf, continue reading our data. Once that's done, voila! We have our bytes.
If we're using this method to read a Starbound world file, we need to also know that the data is zlib-compressed. I won't go far into the details, as it appears like both Python and Java provide libraries for the decompression of such data.
This concludes today's post. I might make another post later about my implementation of the BTreeDB search, written in either Kotlin or Java. In the meantime, if you're interested, check out the following [Python](https://github.com/blixt/py-starbound), [Go](https://github.com/blixt/go-starbound) and [Kotlin (my own)](https://github.com/DanilaFe/kt-starbound) implementations of parsers for BTreeDBs and other Starbound file formats.