Fsion - 1. Size
17 May 2019PLEASE NOTE SOURCE CODE IS NO LONGER AVAILABLE AND THE POST IS HERE JUST FOR INFORMATION
Previously we introduced Fsion - 0. Introduction, now we will explore database size.
The sample data set is around 10 months of daily position files for the 280 iShares funds available online. All fields in the files are loaded apart from any that can be derived.
Funds |
280 |
Days |
206 |
Position files |
71,261 |
Size unzipped |
4.4 GB |
Size .zip normal |
1.1 GB |
Size .7z ultra |
199 MB |
DataSeries Compression
DataSeries is an immutable collection representing an ordered table of Reporting Date, Transaction Id and int64
encoded Values
with the latest values at the top.
This is encoded as a byte array. The first row is stored as varints. Each subsequent row is stored as a difference to the above field values as varints.
With sensible encoding using offsets and the fact that values tend to be close to zero, even single row DataSeries are several times smaller than a more standard representation. Since the table is ordered, and the values in each row are very likely to be close to the ones above, very high compression ratios are possible.
Data Details
Text values are stored in a SetSlim<Text>
collection, numeric values are encoded directly to int64
.
The DataSeries are stored in a MapSlim<EntityAttribute,DataSeries>
.
Below is a table of count and number of bytes by entity type (column) and attribute (row):
Text: Count = 59,099 Max length = 50
Count |
transaction |
entitytype |
attribute |
instrument |
position |
---|---|---|---|---|---|
uri |
0 |
0 |
0 |
0 |
0 |
name |
0 |
5 |
20 |
38,036 |
0 |
time |
71,262 |
0 |
0 |
0 |
0 |
attribute_type |
0 |
0 |
20 |
0 |
0 |
attribute_isset |
0 |
0 |
0 |
0 |
0 |
isin |
0 |
0 |
0 |
36,476 |
0 |
ticker |
0 |
0 |
0 |
38,036 |
0 |
currency |
0 |
0 |
0 |
38,036 |
0 |
assetclass |
0 |
0 |
0 |
38,023 |
0 |
sector |
0 |
0 |
0 |
37,927 |
0 |
exchange |
0 |
0 |
0 |
12,046 |
0 |
country |
0 |
0 |
0 |
38,036 |
0 |
coupon |
0 |
0 |
0 |
25,552 |
0 |
maturity |
0 |
0 |
0 |
25,546 |
0 |
price |
0 |
0 |
0 |
38,036 |
0 |
duration |
0 |
0 |
0 |
25,552 |
0 |
fund |
0 |
0 |
0 |
0 |
147,323 |
instrument |
0 |
0 |
0 |
0 |
147,323 |
nominal |
0 |
0 |
0 |
0 |
147,323 |
Size Estimates
Memory 32-bit: Text = 2.9 MB Data = 61.3 MB Total = 64.2 MB
Memory 64-bit: Text = 3.8 MB Data = 75.1 MB Total = 78.9 MB
Size on disk = 40.3 MB
Extrapolating these curves to 10 years of files would give total memory usage of around 650 MB. The files contain the key changing attributes. A database with a number of additional attributes would be expected to comfortably fit in 1 to 5 GB.
Conclusion
The database file size is small enough to store in github and can be used going forward for testing and performance benchmarking.
Looking at the size on disk compared to 32-bit and 64-bit in memory estimates shows that the objects and pointers contribute a large amount to the size. This is not surprising since each 32-bit object has an 8 byte header and 16 bytes for 64-bit, plus 4 and 8 bytes for each reference respectively. A whole single row DataSeries in the above table is only around 8 bytes.
If the DataSeries were not in a time series compressed format this object and pointer overhead would be a lot higher. This agrees with what is often found in server caches. Holding and tracking fine grained subsets of the database can actually use a lot of memory.
The data modelled is for one of the largest financial asset managers. Fast to calculate derived data such as profit and returns (which also tend to be costly to store) should be excluded. By doing this and storing only transaction data compressed in memory it is possible to use Fsion for many financial databases.
It also shows the estimates in the Data-First Architecture post are too high as they don't take account of the DataSeries compression that is possible.
Next, we will look at the performance characteristics of this database compared to other options.