Analysis
Task description
At bigf.json.bz2 there is a compressed JSON file. In it are described disks with their model and serial number. Count how many models there are and how many times each one is present.
Implementation
The program uses the yajl C library to process the input JSON file. This library has been choosen due to its ability to process incrementally JSON data off a stream.
The file is read on chunks of 65536 bytes. Each JSON dictionary:
{"id":0,"model":"XXXX","serial":"XXXX"}
is processed separately and the model
data is extracted and stored into a hash
table.
The model string is stored into the hash table using the disk_ht_insert
function.
-
disk_ht_ret_t disk_ht_insert(disk_ht_t *self, const char *model)
[source] Store disk model into the hash table. If the model is already present in the hash table the model_cnt value is incremented for the corespondent model.
- Parameters:
self – Reference to disk_ht_t object
model – Model string
- Returns:
DISK_HT_SUCESS for successful insertion, DISK_HT_MEXIST if the model is already present, DISK_HT_EINVAL for invalid argument, DISK_HT_ENOMEM memory allocation failiure.
The hash table is with constant size of DISK_HT_LEN 20
items. Collisions are resolved
by external chaining overlaping hash ids to a linked list on the same index.
Note
The program is constrained by the buffer used for storing the model string which
is set to 1000 bytes by the DISK_MODEL_LEN_MAX
define.
Tests
The test analysis/test/analysis.py processes the data stored in data_sample.json
and compares it with the output of the binary analysis
.
Test output
C binary /home/iliya/Work/StorPool/StorPool/build/analysis/analysis output:
Info: Processing: /home/iliya/Work/StorPool/StorPool/analysis/test/data_sample.json ...
Info: Disk data
HGST2048T: 7
SSDLP2: 5
broken: 13
HGST3T: 3
DRV1: 12
RDV2: 7
HGST8T: 13
SSDF1: 10
SCSI3HD: 6
DSD07461: 10
123456789: 8
SSDDC1: 6
Info: Total entries: 100
Ok
Output for bigf.json
$ analysis -i ~/Downloads/bigf.json
Info: Processing: /home/iliya/Downloads/bigf.json ...
Info: Disk data
HGST2048T: 33332531
SSDLP2: 33345174
broken: 33328584
HGST3T: 33337292
DRV1: 33338513
RDV2: 33332954
MODEL: 1
HGST8T: 33337967
SSDF1: 33328579
SCSI3HD: 33329611
DSD07461: 33333959
123456789: 33327094
SSDDC1: 33327742
Info: Total entries: 400000001