Binary file structure for PAMGUARD detector output.

Douglas Gillespie, 2010

1        Introduction

The primary storage site for PAMGUARD output data is a relational database (currently either MS Access or MySQL although other types may be added in the future).

Pre 2010, the only other storage solutions comes from the click detector which writes binary files in the RainbowClick (*.clk) file format since click data is often additionally processed using RainbowClick offline. Over the next year or so, this offline nalysis functionality will be added to PAMGUARD, making RainbowClick redundant. Long term support for the .clk format is therefore not required.

Current annoyances are

1.     The database is not suitable for storage of variable record length data (e.g. a snip of click waveform as used by RainbowClick or the time/amplitude/frequency contour of a whistle)

2.     The RainbowClick file binary format is awful – works OK from C, but is a nightmare in Java and is not practical to evolve.

3.     The click files need to be written with a random access file writer which is more complicated than a simple data stream output (most Java output types)

4.     Databases often have limited size

Therefore:

1.     We need a replacement for the .clk file format.

2.     We need something similar for other PAMGUARD output tasks.

Therefore we need some new binary format storage solution which can be used by many different PAMGUARD modules for storing detector (and other) data.

2        Binary storage options

Java serialisation: Fast and easy to write out Java objects but cannot open these files with anything but Java (e.g. could not write a Matlab function to access the data).

Pure binary storage (like .clk files) : Need to translate data from each stored object from it’s Java form into a byte array prior to storage and convert back the other way afterwards. However, data would be readable in Matlab or any other program so long as you knew the format of each object.

3        solution:

A common file format for All PAMGUARD module output, pgdf for PAMGUARD Data File.

Output (from PAMGUARD) will assume one way output and input streams rather than random access, although other programs could of course open the files in any way they wish. This means that the files will know their start time, which will be encoded in both the file name and the header, but the end time will only be accessible as the last object in the file – which may take time to read.

Smaller index files (.pgdx)  will be written to accompany each pgdf file which contain just the start header and end footer of each file, so that PAMGUARD can rapidly query a file repository to see what’s in it.

Files will have a common structure, which PAMGUARD will always be able to understand, although specific objects within the file will require information specific to a particular module.

For a specific PAMGUARD configuration, all output files will be stored in the same master directory, although it’s possible that this may contain multiple sub folders, e.g. one sub folder per day.

There will be a 1:1 correspondence between PAMGUARD data blocks and binary data streams (i.e. all objects in a given file should be of the same type).

Output will be a series of binary objects. Every object will start with an int32 (long) integer giving the size of that object in bytes. This number includes itself in the size calculation. So it will always be possible to skip through the file using the following pseudocode:

While not eof

objectSize = ReadLong()

SkipForwardBytes(objectSize-4)

Next

The objects for the header and footer will be rigidly defined. However the data objects can be in any format (it being the responsibility of individual module developers to ensure backwards compatibility should anything change).

Following the standard PAMGUARD header, is an optional control structure or module header. For example, a detector may which to write out it’s detection parameters at this point, probably as a Java serialised object, although anything is allowed, it being the stream that writes the data’s responsibility to read it back in in a sensible way.

Format for main files:

All numbers (short, int, long, float, double) are written using Big Endians, i.e. big byte first. This is the standard for the platform independent Java DataOutputStream class and the C default for Linux and Mac. Matlab can read these files by setting the file format, e.g. f = fopen(fileName, 'r', 'ieee-be.l64');

Windows C programmes processing the files would need to re-order bytes.


 

 

 

 

Format

Notes

File Header

Length of file header in bytes

Int32

Every object will start with this number.

Object Identifier

Int32

-1

Header / general file Format

Int32

Hope this changes very very rarely

“PAMGUARDDATA”

Char(12)

Just so it’s obvious that this really is a P file

PAMGUARD Version

CharUTF*

e.g. 1.8.00

PAMGUARD Branch

CharUTF*

e.g. Core, Beta, etc.

Data Date

Long (int64)

Data time at start of file in Java millis

Analysis Date

Long (int64)

Time at which analysis started (same as data time for real time)

File Start Sample

Long (int64)

Current sample number for this data stream

Module type

CharUTF*

Module type

Module Name

CharUTF*

Module name

Stream Name

CharUTF*

Data stream name

Extra info length

Int32

Length of additional data

Extra info

byte[]

Additional data

Module specific Control Structure

Length in File

Int32

Length of this object = 16 + object binary length

Object Identifier

Int32

-3 

Module version Info

Int32

Version info specific to the pamguard module writing data to this stream.

Object binary Length

Int32

= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data

Object Data

Byte[]

Length = Object binary Length

Object 1

Length in File

Int32

Length of this object

Object Identifier

Int32

class identifier which must be unique to this data stream, not across PAMGUARD

Time milliseconds

Int32

Timestamp in milliseconds relative to start of file “Data Date”

Object binary Length

Int32

= Length in File – 12 (a bit of redundancy) !

Object Data

Byte[]

Length = Object binary Length

Object 2

Length in File

Int32

Length of this object

Object Identifier

Int32

class identifier which must be unique to this data stream, not across PAMGUARD

Time milliseconds

Int32

Timestamp in milliseconds relative to start of file “Data Date”

Object binary Length

Int32

= Length in File – 12 (a bit of redundancy) !

Object Data

Byte[]

Length = Object binary Length. Need not be same as Object 1.

Etc …

 

 

 

Module footer

Length in File

Int32

Length of this object = 16 + object binary length

Object Identifier

Int32

-3 

Object binary Length

Int32

= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data

Object Data

Byte[]

Length = Object binary Length

File Footer

Length of footer in bytes

Int32

 

Object identifier

Int32

-2

Total number of objects in file

Int32

Not counting header, control struct and footer (i.e. can be = 0)

Data Date

Long (int64)

Data time at end of file in Java millis

Analysis Date

Long (int64)

Time at which analysis ended (same as data time for real time)

File End Sample

Long (int64)

Sample number at end of file

File length

Long (int64)

Total length of the file (will be more use when this is repeated in an index file)

File End Reason

Int32

Reason file ended

EOF

 

 

 

*Strings are often written with the DataOutputStream.writeUTF() function. For standard ASCII characters, this will simply be a two bytes (written as a short) giving the length of the string followed by one byte per character. Unicode characters are also supported in this format – for details see the JAVA Help and Wikipedia.

Smaller indexing files (.pgdx) will contain the header, control structures and footer from the pgdf files, but none of the objects.

Object identifiers

Object identifiers used by the file management system are negative

-1 = Header;

-2 = Footer;

-3 = Module header

-4 = Module footer

Individual modules can use any positive number and these numbers need only be unique within the module reading and writing the data. 

Example data formats for specific modules

Click Detector Version 0

 

Time millis

Int64

 

 

Start sample

Int64

 

 

Channel map

Int32

 

 

Triggered channels

Int32

 

 

Num delay measurements

Int16

 

 

Delay measurements

Float[]

Usually nChan*(nChan-1)/2

 

Num angle measurements

Int16

(0, 1 or 2)

 

Angle measurements

Float[]

 

 

Duration (samples)

Int16

 

 

WaveData

Int16[][]

 

 

Click Detector Version 1

 

Time millis

Int64

 

 

Start sample

Int64

 

 

Channel map

Int32

 

 

Triggered channels

Int32

 

 

Click Type

Int16

 

 

Num delay measurements

Int16

Usually nChan*(nChan-1)/2

 

Delay measurements

Float[]

 

 

Num angle measurements

Int16

(0, 1 or 2)

 

Angle measurements

Float[]

 

 

Duration (samples)

Int16

 

 

Wave Max Amplitude

float

Max amplitude of wave data.

 

WaveData

Int8[][]

Wavedata scaled by 127/max amplitude so if uses full dynamic range of 8 bit data

 

 

4        Binary Storage Manager

Binary Storage Manager (BSM) is a plugin module in the utilities group (next to database, since the two things perform similar function). Only one BSM module will be allowed

BSM will control:

1.     The folder data are stored in.

2.     Opening new files at program start and closing them at program end. Will set file names based on the data stream name and the time.

3.     Reopening files on the hour (or at some other set time interval, e.g. when analysing wav file data offline, it may make a new file every time a file starts)

4.     Reloading data for the PAMGUARD viewer.

5.     Rewriting data files changed during offline analysis

 

5        PAMGUARD Modules output data

For data streams – modify PamDataBlock, and also make possibility of other objects sending data to BSM

Need to:

1.     register with BSM

2.     hand BSM settings data, which will be requested each time a new file is created (see format above).

3.     send BSM references to all new data units

4.     know how to recreate data units from data that have been read back in with PAMGUARD viewer.

PAMGUARD Data Units

1.     Need functions to return their binary data object (which may just be a serialised version of themselves

2.     Need to be able to identify themselves in agreement with a central registry

 

PAMGUARD Settings

PAMGUARD Settings are written to the binary store whenever PAMGUARD starts from the Start menu or from the Network controller. They are not written when PAMGUARD restarts due to a buffer overflow in acquisition. Settings are written to .psfx files. These encapsulate the current psf format used for more general settings, but individual serialised Java objects are wrapped up in a similar way to other binary data so that other programmes (e.g. Matlab) can at least read a list of modules.