Binary file structure for PAMGUARD detector output.

Douglas Gillespie, 2010

1 Introduction

The primary storage site for PAMGUARD output data is a relational database (currently either MS Access or MySQL although other types may be added in the future).

Pre 2010, the only other storage solutions comes from the click detector which writes binary files in the RainbowClick (*.clk) file format since click data is often additionally processed using RainbowClick offline. Over the next year or so, this offline nalysis functionality will be added to PAMGUARD, making RainbowClick redundant. Long term support for the .clk format is therefore not required.

Current annoyances are

1. The database is not suitable for storage of variable record length data (e.g. a snip of click waveform as used by RainbowClick or the time/amplitude/frequency contour of a whistle)

2. The RainbowClick file binary format is awful – works OK from C, but is a nightmare in Java and is not practical to evolve.

3. The click files need to be written with a random access file writer which is more complicated than a simple data stream output (most Java output types)

4. Databases often have limited size

Therefore:

1. We need a replacement for the .clk file format.

2. We need something similar for other PAMGUARD output tasks.

Therefore we need some new binary format storage solution which can be used by many different PAMGUARD modules for storing detector (and other) data.

2 Binary storage options

Java serialisation: Fast and easy to write out Java objects but cannot open these files with anything but Java (e.g. could not write a Matlab function to access the data).

Pure binary storage (like .clk files) : Need to translate data from each stored object from it’s Java form into a byte array prior to storage and convert back the other way afterwards. However, data would be readable in Matlab or any other program so long as you knew the format of each object.

3 solution:

A common file format for All PAMGUARD module output, pgdf for PAMGUARD Data File.

Output (from PAMGUARD) will assume one way output and input streams rather than random access, although other programs could of course open the files in any way they wish. This means that the files will know their start time, which will be encoded in both the file name and the header, but the end time will only be accessible as the last object in the file – which may take time to read.

Smaller index files (.pgdx) will be written to accompany each pgdf file which contain just the start header and end footer of each file, so that PAMGUARD can rapidly query a file repository to see what’s in it.

Files will have a common structure, which PAMGUARD will always be able to understand, although specific objects within the file will require information specific to a particular module.

For a specific PAMGUARD configuration, all output files will be stored in the same master directory, although it’s possible that this may contain multiple sub folders, e.g. one sub folder per day.

There will be a 1:1 correspondence between PAMGUARD data blocks and binary data streams (i.e. all objects in a given file should be of the same type).

Output will be a series of binary objects. Every object will start with an int32 (long) integer giving the size of that object in bytes. This number includes itself in the size calculation. So it will always be possible to skip through the file using the following pseudocode:

While not eof

objectSize = ReadLong()

SkipForwardBytes(objectSize-4)

The objects for the header and footer will be rigidly defined. However the data objects can be in any format (it being the responsibility of individual module developers to ensure backwards compatibility should anything change).

Following the standard PAMGUARD header, is an optional control structure or module header. For example, a detector may which to write out it’s detection parameters at this point, probably as a Java serialised object, although anything is allowed, it being the stream that writes the data’s responsibility to read it back in in a sensible way.

Format for main files:

All numbers (short, int, long, float, double) are written using Big Endians, i.e. big byte first. This is the standard for the platform independent Java DataOutputStream class and the C default for Linux and Mac. Matlab can read these files by setting the file format, e.g. f = fopen(fileName, 'r', 'ieee-be.l64');

Windows C programmes processing the files would need to re-order bytes.

		Format	Notes
File Header	Length of file header in bytes	Int32	Every object will start with this number.
	Object Identifier	Int32	-1
	Header / general file Format	Int32	Hope this changes very very rarely
	“PAMGUARDDATA”	Char(12)	Just so it’s obvious that this really is a P file
	PAMGUARD Version	CharUTF*	e.g. 1.8.00
	PAMGUARD Branch	CharUTF*	e.g. Core, Beta, etc.
	Data Date	Long (int64)	Data time at start of file in Java millis
	Analysis Date	Long (int64)	Time at which analysis started (same as data time for real time)
	File Start Sample	Long (int64)	Current sample number for this data stream
	Module type	CharUTF*	Module type
	Module Name	CharUTF*	Module name
	Stream Name	CharUTF*	Data stream name
	Extra info length	Int32	Length of additional data
	Extra info	byte[]	Additional data
Module specific Control Structure	Length in File	Int32	Length of this object = 16 + object binary length
	Object Identifier	Int32	-3
	Module version Info	Int32	Version info specific to the pamguard module writing data to this stream.
	Object binary Length	Int32	= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data
	Object Data	Byte[]	Length = Object binary Length
Object 1	Length in File	Int32	Length of this object
	Object Identifier	Int32	class identifier which must be unique to this data stream, not across PAMGUARD
	Time milliseconds	Int32	Timestamp in milliseconds relative to start of file “Data Date”
	Object binary Length	Int32	= Length in File – 12 (a bit of redundancy) !
	Object Data	Byte[]	Length = Object binary Length
Object 2	Length in File	Int32	Length of this object
	Object Identifier	Int32	class identifier which must be unique to this data stream, not across PAMGUARD
	Time milliseconds	Int32	Timestamp in milliseconds relative to start of file “Data Date”
	Object binary Length	Int32	= Length in File – 12 (a bit of redundancy) !
	Object Data	Byte[]	Length = Object binary Length. Need not be same as Object 1.
Etc …
Module footer	Length in File	Int32	Length of this object = 16 + object binary length
	Object Identifier	Int32	-3
	Object binary Length	Int32	= Length in File – 16 (a bit of redundancy) ! Can be zero if there is no additional data
	Object Data	Byte[]	Length = Object binary Length
File Footer	Length of footer in bytes	Int32
	Object identifier	Int32	-2
	Total number of objects in file	Int32	Not counting header, control struct and footer (i.e. can be = 0)
	Data Date	Long (int64)	Data time at end of file in Java millis
	Analysis Date	Long (int64)	Time at which analysis ended (same as data time for real time)
	File End Sample	Long (int64)	Sample number at end of file
	File length	Long (int64)	Total length of the file (will be more use when this is repeated in an index file)
	File End Reason	Int32	Reason file ended
EOF

*Strings are often written with the DataOutputStream.writeUTF() function. For standard ASCII characters, this will simply be a two bytes (written as a short) giving the length of the string followed by one byte per character. Unicode characters are also supported in this format – for details see the JAVA Help and Wikipedia.

Smaller indexing files (.pgdx) will contain the header, control structures and footer from the pgdf files, but none of the objects.

Object identifiers

Object identifiers used by the file management system are negative

-1 = Header;

-2 = Footer;

-3 = Module header

-4 = Module footer

Individual modules can use any positive number and these numbers need only be unique within the module reading and writing the data.

Example data formats for specific modules

Click Detector Version 0

Time millis	Int64
Start sample	Int64
Channel map	Int32
Triggered channels	Int32
Num delay measurements	Int16
Delay measurements	Float[]	Usually nChan*(nChan-1)/2
Num angle measurements	Int16	(0, 1 or 2)
Angle measurements	Float[]
Duration (samples)	Int16
WaveData	Int16[][]

Click Detector Version 1

Time millis	Int64
Start sample	Int64
Channel map	Int32
Triggered channels	Int32
Click Type	Int16
Num delay measurements	Int16	Usually nChan*(nChan-1)/2
Delay measurements	Float[]
Num angle measurements	Int16	(0, 1 or 2)
Angle measurements	Float[]
Duration (samples)	Int16
Wave Max Amplitude	float	Max amplitude of wave data.
WaveData	Int8[][]	Wavedata scaled by 127/max amplitude so if uses full dynamic range of 8 bit data

4 Binary Storage Manager

Binary Storage Manager (BSM) is a plugin module in the utilities group (next to database, since the two things perform similar function). Only one BSM module will be allowed

BSM will control:

1. The folder data are stored in.

2. Opening new files at program start and closing them at program end. Will set file names based on the data stream name and the time.

3. Reopening files on the hour (or at some other set time interval, e.g. when analysing wav file data offline, it may make a new file every time a file starts)

4. Reloading data for the PAMGUARD viewer.

5. Rewriting data files changed during offline analysis

5 PAMGUARD Modules output data

For data streams – modify PamDataBlock, and also make possibility of other objects sending data to BSM

Need to:

1. register with BSM

2. hand BSM settings data, which will be requested each time a new file is created (see format above).

3. send BSM references to all new data units

4. know how to recreate data units from data that have been read back in with PAMGUARD viewer.

PAMGUARD Data Units

1. Need functions to return their binary data object (which may just be a serialised version of themselves

2. Need to be able to identify themselves in agreement with a central registry

PAMGUARD Settings

PAMGUARD Settings are written to the binary store whenever PAMGUARD starts from the Start menu or from the Network controller. They are not written when PAMGUARD restarts due to a buffer overflow in acquisition. Settings are written to .psfx files. These encapsulate the current psf format used for more general settings, but individual serialised Java objects are wrapped up in a similar way to other binary data so that other programmes (e.g. Matlab) can at least read a list of modules.