Binary file structure for PAMGUARD detector output.
Douglas Gillespie, 2010
The primary storage site for PAMGUARD output data is a relational database (currently either MS Access or MySQL although other types may be added in the future).
Pre 2010, the only other storage solutions comes from the click detector which writes binary files in the RainbowClick (*.clk) file format since click data is often additionally processed using RainbowClick offline. Over the next year or so, this offline nalysis functionality will be added to PAMGUARD, making RainbowClick redundant. Long term support for the .clk format is therefore not required.
Current annoyances are
1. The database is not suitable for storage of variable record length data (e.g. a snip of click waveform as used by RainbowClick or the time/amplitude/frequency contour of a whistle)
2. The RainbowClick file binary format is awful – works OK from C, but is a nightmare in Java and is not practical to evolve.
3. The click files need to be written with a random access file writer which is more complicated than a simple data stream output (most Java output types)
4. Databases often have limited size
Therefore:
1. We need a replacement for the .clk file format.
2. We need something similar for other PAMGUARD output tasks.
Therefore we need some new binary format storage solution which can be used by many different PAMGUARD modules for storing detector (and other) data.
Java serialisation: Fast and easy to write out Java objects but cannot open these files with anything but Java (e.g. could not write a Matlab function to access the data).
Pure binary storage (like .clk files) : Need to translate data from each stored object from it’s Java form into a byte array prior to storage and convert back the other way afterwards. However, data would be readable in Matlab or any other program so long as you knew the format of each object.
A common file format for All PAMGUARD module output, pgdf for PAMGUARD Data File.
Output (from PAMGUARD) will assume one way output and input streams rather than random access, although other programs could of course open the files in any way they wish. This means that the files will know their start time, which will be encoded in both the file name and the header, but the end time will only be accessible as the last object in the file – which may take time to read.
Smaller index files (.pgdx) will be written to accompany each pgdf file which contain just the start header and end footer of each file, so that PAMGUARD can rapidly query a file repository to see what’s in it.
Files will have a common structure, which PAMGUARD will always be able to understand, although specific objects within the file will require information specific to a particular module.
For a specific PAMGUARD configuration, all output files will be stored in the same master directory, although it’s possible that this may contain multiple sub folders, e.g. one sub folder per day.
There will be a 1:1 correspondence between PAMGUARD data blocks and binary data streams (i.e. all objects in a given file should be of the same type).
Output will be a series of binary objects. Every object will start with an int32 (long) integer giving the size of that object in bytes. This number includes itself in the size calculation. So it will always be possible to skip through the file using the following pseudocode:
While not eof
objectSize = ReadLong()
SkipForwardBytes(objectSize-4)
Next
The objects for the header and footer will be rigidly defined. However the data objects can be in any format (it being the responsibility of individual module developers to ensure backwards compatibility should anything change).
Following the standard PAMGUARD header, is an optional control structure or module header. For example, a detector may which to write out it’s detection parameters at this point, probably as a Java serialised object, although anything is allowed, it being the stream that writes the data’s responsibility to read it back in in a sensible way.
All numbers (short, int, long, float, double) are written using Big Endians, i.e. big byte first. This is the standard for the platform independent Java DataOutputStream class and the C default for Linux and Mac. Matlab can read these files by setting the file format, e.g. f = fopen(fileName, 'r', 'ieee-be.l64');
Windows C programmes processing the files would need to re-order bytes.
|
|
Format |
Notes |
File Header |
Length of file header in bytes |
Int32 |
Every object will start with this number. |
Object Identifier |
Int32 |
-1 |
|
Header / general file Format |
Int32 |
Hope this changes very very rarely |
|
“PAMGUARDDATA” |
Char(12) |
Just so it’s obvious that this really is a P file |
|
PAMGUARD Version |
CharUTF* |
e.g. 1.8.00 |
|
PAMGUARD Branch |
CharUTF* |
e.g. Core, Beta, etc. |
|
Data Date |
Long (int64) |
Data time at start of file in Java millis |
|
Analysis Date |
Long (int64) |
Time at which analysis started (same as data time for real time) |
|
File Start Sample |
Long (int64) |
Current sample number for this data stream |
|
Module type |
CharUTF* |
Module type |
|
Module Name |
CharUTF* |
Module name |
|
Stream Name |
CharUTF* |
Data stream name |
|
Extra info length |
Int32 |
Length of additional data |
|
Extra info |
byte[] |
Additional data |
|
Module specific Control Structure |
Length in File |
Int32 |
Length of this object = 16 + object binary length |
Object Identifier |
Int32 |
-3 |
|
Module version Info |
Int32 |
Version info specific to the pamguard module writing data to this stream. |
|
Object binary Length |
Int32 |
= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data |
|
Object Data |
Byte[] |
Length = Object binary Length |
|
Object 1 |
Length in File |
Int32 |
Length of this object |
Object Identifier |
Int32 |
class identifier which must be unique to this data stream, not across PAMGUARD |
|
Time milliseconds |
Int32 |
Timestamp in milliseconds relative to start of file “Data Date” |
|
Object binary Length |
Int32 |
= Length in File – 12 (a bit of redundancy) ! |
|
Object Data |
Byte[] |
Length = Object binary Length |
|
Object 2 |
Length in File |
Int32 |
Length of this object |
Object Identifier |
Int32 |
class identifier which must be unique to this data stream, not across PAMGUARD |
|
Time milliseconds |
Int32 |
Timestamp in milliseconds relative to start of file “Data Date” |
|
Object binary Length |
Int32 |
= Length in File – 12 (a bit of redundancy) ! |
|
Object Data |
Byte[] |
Length = Object binary Length. Need not be same as Object 1. |
|
Etc … |
|
|
|
Module footer |
Length in File |
Int32 |
Length of this object = 16 + object binary length |
Object Identifier |
Int32 |
-3 |
|
Object binary Length |
Int32 |
= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data |
|
Object Data |
Byte[] |
Length = Object binary Length |
|
File Footer |
Length of footer in bytes |
Int32 |
|
Object identifier |
Int32 |
-2 |
|
Total number of objects in file |
Int32 |
Not counting header, control struct and footer (i.e. can be = 0) |
|
Data Date |
Long (int64) |
Data time at end of file in Java millis |
|
Analysis Date |
Long (int64) |
Time at which analysis ended (same as data time for real time) |
|
File End Sample |
Long (int64) |
Sample number at end of file |
|
File length |
Long (int64) |
Total length of the file (will be more use when this is repeated in an index file) |
|
File End Reason |
Int32 |
Reason file ended |
|
EOF |
|
|
|
*Strings are often written with the DataOutputStream.writeUTF() function. For standard ASCII characters, this will simply be a two bytes (written as a short) giving the length of the string followed by one byte per character. Unicode characters are also supported in this format – for details see the JAVA Help and Wikipedia.
Smaller indexing files (.pgdx) will contain the header, control structures and footer from the pgdf files, but none of the objects.
Object identifiers used by the file management system are negative
-1 = Header;
-2 = Footer;
-3 = Module header
-4 = Module footer
Individual modules can use any positive number and these numbers need only be unique within the module reading and writing the data.
Example data formats for specific modules
Click Detector Version 0
|
Time millis |
Int64 |
|
|
Start sample |
Int64 |
|
|
Channel map |
Int32 |
|
|
Triggered channels |
Int32 |
|
|
Num delay measurements |
Int16 |
|
|
Delay measurements |
Float[] |
Usually nChan*(nChan-1)/2 |
|
Num angle measurements |
Int16 |
(0, 1 or 2) |
|
Angle measurements |
Float[] |
|
|
Duration (samples) |
Int16 |
|
|
WaveData |
Int16[][] |
|
Click Detector Version 1
|
Time millis |
Int64 |
|
|
Start sample |
Int64 |
|
|
Channel map |
Int32 |
|
|
Triggered channels |
Int32 |
|
|
Click Type |
Int16 |
|
|
Num delay measurements |
Int16 |
Usually nChan*(nChan-1)/2 |
|
Delay measurements |
Float[] |
|
|
Num angle measurements |
Int16 |
(0, 1 or 2) |
|
Angle measurements |
Float[] |
|
|
Duration (samples) |
Int16 |
|
|
Wave Max Amplitude |
float |
Max amplitude of wave data. |
|
WaveData |
Int8[][] |
Wavedata scaled by 127/max amplitude so if uses full dynamic range of 8 bit data |
Binary Storage Manager (BSM) is a plugin module in the utilities group (next to database, since the two things perform similar function). Only one BSM module will be allowed
BSM will control:
1. The folder data are stored in.
2. Opening new files at program start and closing them at program end. Will set file names based on the data stream name and the time.
3. Reopening files on the hour (or at some other set time interval, e.g. when analysing wav file data offline, it may make a new file every time a file starts)
4. Reloading data for the PAMGUARD viewer.
5. Rewriting data files changed during offline analysis
For data streams – modify PamDataBlock, and also make possibility of other objects sending data to BSM
Need to:
1. register with BSM
2. hand BSM settings data, which will be requested each time a new file is created (see format above).
3. send BSM references to all new data units
4. know how to recreate data units from data that have been read back in with PAMGUARD viewer.
PAMGUARD Data Units
1. Need functions to return their binary data object (which may just be a serialised version of themselves
2. Need to be able to identify themselves in agreement with a central registry
PAMGUARD Settings
PAMGUARD Settings are written to the binary store whenever PAMGUARD starts from the Start menu or from the Network controller. They are not written when PAMGUARD restarts due to a buffer overflow in acquisition. Settings are written to .psfx files. These encapsulate the current psf format used for more general settings, but individual serialised Java objects are wrapped up in a similar way to other binary data so that other programmes (e.g. Matlab) can at least read a list of modules.