3. Data loader

3.1. Loader Interface

Interface to select a specified loader

sed.loader.loader_interface.get_loader(loader_name, config=None)

Helper function to get the loader object from it’s given name.

Parameters:
  • loader_name (str) – Name of the loader

  • config (dict, optional) – Configuration dictionary. Defaults to None.

Raises:

ValueError – Raised if the loader cannot be found.

Returns:

The loader object.

Return type:

BaseLoader

sed.loader.loader_interface.get_names_of_all_loaders()

Helper function to populate a list of all available loaders.

Returns:

List of all detected loader names.

Return type:

List[str]

3.2. Abstract BaseLoader

The abstract class off of which to implement loaders.

class sed.loader.base.loader.BaseLoader(config=None)

Bases: ABC

The abstract class off of which to implement loaders.

The reader’s folder name is the identifier. For this BaseLoader with filename base/loader.py the ID becomes ‘base’

Parameters:
  • config (dict, optional) – Config dictionary. Defaults to None.

  • meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.

supported_file_types: typing.List[str] = []
abstract read_dataframe(files=None, folders=None, runs=None, ftype=None, metadata=None, collect_metadata=False, **kwds)

Reads data from given files, folder, or runs and returns a dask dataframe and corresponding metadata.

Parameters:
  • files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.

  • folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.

  • runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by folders. Takes precendence over files and folders. Defaults to None.

  • ftype (str, optional) – File type to read (‘parquet’, ‘json’, ‘csv’, etc). If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to None.

  • metadata (dict, optional) – Manual metadata dictionary. Auto-generated metadata will be added to it. Defaults to None.

  • collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.

  • **kwds – keyword arguments. See description in respective loader.

Returns:

Dask dataframe, timed dataframe and metadata read from specified files.

Return type:

Tuple[ddf.DataFrame, dict]

abstract get_files_from_run_id(run_id, folders=None, extension=None, **kwds)

Locate the files for a given run identifier.

Parameters:
  • run_id (str) – The run identifier to locate.

  • folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to None.

  • extension (str, optional) – The file extension. Defaults to None.

  • kwds – Keyword arguments

Returns:

List of files for the given run.

Return type:

List[str]

abstract get_count_rate(fids=None, **kwds)

Create count rate data for the files specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

Arrays containing countrate and seconds into the scan.

Return type:

Tuple[np.ndarray, np.ndarray]

abstract get_elapsed_time(fids=None, **kwds)

Return the elapsed time in the specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

The elapsed time in the files in seconds.

Return type:

float

sed.loader.base.loader.LOADER

alias of BaseLoader

3.3. GenericLoader

module sed.loader.mpes, code for loading hdf5 files delayed into a dask dataframe. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig

class sed.loader.generic.loader.GenericLoader(config=None)

Bases: BaseLoader

Dask implementation of the Loader. Reads from various file types using the utilities of Dask.

Parameters:
  • config (dict, optional) – Config dictionary. Defaults to None.

  • meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.

supported_file_types: typing.List[str] = ['parquet', 'csv', 'json']
read_dataframe(files=None, folders=None, runs=None, ftype='parquet', metadata=None, collect_metadata=False, **kwds)

Read stored files from a folder into a dataframe.

Parameters:
  • files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.

  • folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.

  • runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by folders. Takes precendence over files and folders. Defaults to None.

  • ftype (str, optional) – File type to read (‘parquet’, ‘json’, ‘csv’, etc). If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to “parquet”.

  • metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.

  • collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.

  • **kwds – keyword arguments. See the keyword arguments for the specific file parser in``dask.dataframe`` module.

Raises:
  • ValueError – Raised if neither files nor folder provided.

  • FileNotFoundError – Raised if the fileds or folder cannot be found.

  • ValueError – Raised if the file type is not supported.

Returns:

Dask dataframe, timed dataframe and metadata read from specified files.

Return type:

Tuple[ddf.DataFrame, dict]

get_files_from_run_id(run_id, folders=None, extension=None, **kwds)

Locate the files for a given run identifier.

Parameters:
  • run_id (str) – The run identifier to locate.

  • folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to None.

  • extension (str, optional) – The file extension. Defaults to “h5”.

  • kwds – Keyword arguments

Returns:

Path to the location of run data.

Return type:

str

get_count_rate(fids=None, **kwds)

Create count rate data for the files specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

Arrays containing countrate and seconds into the scan.

Return type:

Tuple[np.ndarray, np.ndarray]

get_elapsed_time(fids=None, **kwds)

Return the elapsed time in the files specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

The elapsed time in the files in seconds.

Return type:

float

files: List[str]
runs: List[str]
metadata: Dict[Any, Any]
sed.loader.generic.loader.LOADER

alias of GenericLoader

3.4. MpesLoader

module sed.loader.mpes, code for loading hdf5 files delayed into a dask dataframe. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig

sed.loader.mpes.loader.hdf5_to_dataframe(files, group_names=None, alias_dict=None, time_stamps=False, time_stamp_alias='timeStamps', ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp', **kwds)

Function to read a selection of hdf5-files, and generate a delayed dask dataframe from provided groups in the files. Optionally, aliases can be defined.

Parameters:
  • files (List[str]) – A list of the file paths to load.

  • group_names (List[str], optional) – hdf5 group names to load. Defaults to load all groups containing “Stream”

  • alias_dict (Dict[str, str], optional) – Dictionary of aliases for the dataframe columns. Keys are the hdf5 groupnames, and values the aliases. If an alias is not found, its group name is used. Defaults to read the attribute “Name” from each group.

  • time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.

  • time_stamp_alias (str) – Alias name for the timestamp column. Defaults to “timeStamps”.

  • ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.

  • first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.

Returns:

The delayed Dask DataFrame

Return type:

ddf.DataFrame

sed.loader.mpes.loader.hdf5_to_timed_dataframe(files, group_names=None, alias_dict=None, time_stamps=False, time_stamp_alias='timeStamps', ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp', **kwds)

Function to read a selection of hdf5-files, and generate a delayed dask dataframe from provided groups in the files. Optionally, aliases can be defined. Returns a dataframe for evenly spaced time intervals.

Parameters:
  • files (List[str]) – A list of the file paths to load.

  • group_names (List[str], optional) – hdf5 group names to load. Defaults to load all groups containing “Stream”

  • alias_dict (Dict[str, str], optional) – Dictionary of aliases for the dataframe columns. Keys are the hdf5 groupnames, and values the aliases. If an alias is not found, its group name is used. Defaults to read the attribute “Name” from each group.

  • time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.

  • time_stamp_alias (str) – Alias name for the timestamp column. Defaults to “timeStamps”.

  • ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.

  • first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.

Returns:

The delayed Dask DataFrame

Return type:

ddf.DataFrame

sed.loader.mpes.loader.get_groups_and_aliases(h5file, seach_pattern=None, alias_key='Name')

Read groups and aliases from a provided hdf5 file handle

Parameters:
  • h5file (h5py.File) – The hdf5 file handle

  • seach_pattern (str, optional) – Search pattern to select groups. Defaults to include all groups.

  • alias_key (str, optional) – Attribute key where aliases are stored. Defaults to “Name”.

Returns:

The list of groupnames and the alias dictionary parsed from the file

Return type:

Tuple[List[str], Dict[str, str]]

sed.loader.mpes.loader.hdf5_to_array(h5file, group_names, data_type='float32', time_stamps=False, ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp')

Reads the content of the given groups in an hdf5 file, and returns a 2-dimensional array with the corresponding values.

Parameters:
  • h5file (h5py.File) – hdf5 file handle to read from

  • group_names (str) – group names to read

  • data_type (str, optional) – Data type of the output data. Defaults to “float32”.

  • time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.

  • ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.

  • first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.

Returns:

The 2-dimensional data array containing the values of the groups.

Return type:

np.ndarray

sed.loader.mpes.loader.hdf5_to_timed_array(h5file, group_names, data_type='float32', time_stamps=False, ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp')

Reads the content of the given groups in an hdf5 file, and returns a timed version of a 2-dimensional array with the corresponding values.

Parameters:
  • h5file (h5py.File) – hdf5 file handle to read from

  • group_names (str) – group names to read

  • data_type (str, optional) – Data type of the output data. Defaults to “float32”.

  • time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.

  • ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.

  • first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.

Returns:

the array of the values at evently spaced timing obtained from the ms_markers.

Return type:

np.ndarray

sed.loader.mpes.loader.get_attribute(h5group, attribute)

Reads, decodes and returns an attrubute from an hdf5 group

Parameters:
  • h5group (h5py.Group) – The hdf5 group to read from

  • attribute (str) – The name of the attribute

Returns:

The parsed attribute data

Return type:

str

sed.loader.mpes.loader.get_count_rate(h5file, ms_markers_group='msMarkers')

Create count rate in the file from the msMarker column.

Parameters:
  • h5file (h5py.File) – The h5file from which to get the count rate.

  • ms_markers_group (str, optional) – The hdf5 group where the millisecond markers are stored. Defaults to “msMarkers”.

Returns:

The count rate in Hz and the seconds into the scan.

Return type:

Tuple[np.ndarray, np.ndarray]

sed.loader.mpes.loader.get_elapsed_time(h5file, ms_markers_group='msMarkers')

Return the elapsed time in the file from the msMarkers wave

Parameters:
  • h5file (h5py.File) – The h5file from which to get the count rate.

  • ms_markers_group (str, optional) – The hdf5 group where the millisecond markers are stored. Defaults to “msMarkers”.

Returns:

The acquision time of the file in seconds.

Return type:

float

class sed.loader.mpes.loader.MpesLoader(config=None)

Bases: BaseLoader

Mpes implementation of the Loader. Reads from h5 files or folders of the SPECS Metis 1000 (FHI Berlin)

Parameters:
  • config (dict, optional) – Config dictionary. Defaults to None.

  • meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.

supported_file_types: typing.List[str] = ['h5']
read_dataframe(files=None, folders=None, runs=None, ftype='h5', metadata=None, collect_metadata=False, time_stamps=False, **kwds)

Read stored hdf5 files from a list or from folder and returns a dask dataframe and corresponding metadata.

Parameters:
  • files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.

  • folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.

  • runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by folders. Takes precendence over files and folders. Defaults to None.

  • ftype (str, optional) – File extension to use. If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to “h5”.

  • metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.

  • collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.

  • time_stamps (bool, optional) – Option to create a time_stamp column in the dataframe from ms-Markers in the files. Defaults to False.

  • **kwds

    Keyword parameters.

    • hdf5_groupnames : List of groupnames to look for in the file.

    • hdf5_aliases: Dictionary of aliases for the groupnames.

    • time_stamp_alias: Alias for the timestamp column

    • ms_markers_group: Group name of the millisecond marker column.

    • first_event_time_stamp_key: Attribute name containing the start timestamp of the file.

    Additional keywords are passed to hdf5_to_dataframe.

Raises:
  • ValueError – raised if neither files or folder provided.

  • FileNotFoundError – Raised if a file or folder is not found.

Returns:

Dask dataframe, timed Dask dataframe and metadata read from specified files.

Return type:

Tuple[ddf.DataFrame, ddf.DataFrame, dict]

get_files_from_run_id(run_id, folders=None, extension='h5', **kwds)

Locate the files for a given run identifier.

Parameters:
  • run_id (str) – The run identifier to locate.

  • folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to config[“core”][“base_folder”]

  • extension (str, optional) – The file extension. Defaults to “h5”.

  • kwds – Keyword arguments

Returns:

List of file path strings to the location of run data.

Return type:

List[str]

gather_metadata(files, metadata=None)

Collect meta data from files

Parameters:
  • files (Sequence[str]) – List of files loaded

  • metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.

Returns:

The completed metadata dictionary.

Return type:

dict

get_count_rate(fids=None, **kwds)

Create count rate from the msMarker column for the files specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds

    Keyword arguments:

    • ms_markers_group: Name of the hdf5 group containing the ms-markers

Returns:

Arrays containing countrate and seconds into the scan.

Return type:

Tuple[np.ndarray, np.ndarray]

get_elapsed_time(fids=None, **kwds)

Return the elapsed time in the files specified in fids from the msMarkers column.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds

    Keyword arguments:

    • ms_markers_group: Name of the hdf5 group containing the ms-markers

Returns:

The elapsed time in the files in seconds.

Return type:

float

files: List[str]
runs: List[str]
metadata: Dict[Any, Any]
sed.loader.mpes.loader.LOADER

alias of MpesLoader

3.5. FlashLoader

This module implements the flash data loader. This loader currently supports hextof, wespe and instruments with similar structure. The raw hdf5 data is combined and saved into buffer files and loaded as a dask dataframe. The dataframe is a amalgamation of all h5 files for a combination of runs, where the NaNs are automatically forward filled across different files. This can then be saved as a parquet for out-of-sed processing and reread back to access other sed funtionality.

class sed.loader.flash.loader.FlashLoader(config)

Bases: BaseLoader

The class generates multiindexed multidimensional pandas dataframes from the new FLASH dataformat resolved by both macro and microbunches alongside electrons. Only the read_dataframe (inherited and implemented) method is accessed by other modules.

Parameters:

config (dict) –

supported_file_types: typing.List[str] = ['h5']
initialize_paths()

Initializes the paths based on the configuration.

Returns:

A tuple containing a list of raw data directories paths and the parquet data directory path.

Return type:

Tuple[List[Path], Path]

Raises:
  • ValueError – If required values are missing from the configuration.

  • FileNotFoundError – If the raw data directories are not found.

get_files_from_run_id(run_id, folders=None, extension='h5', **kwds)

Returns a list of filenames for a given run located in the specified directory for the specified data acquisition (daq).

Parameters:
  • run_id (str) – The run identifier to locate.

  • folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to config[“core”][“base_folder”].

  • extension (str, optional) – The file extension. Defaults to “h5”.

  • kwds – Keyword arguments: - daq (str): The data acquisition identifier. Defaults to config[“dataframe”][“daq”].

Returns:

A list of path strings representing the collected file names.

Return type:

List[str]

Raises:

FileNotFoundError – If no files are found for the given run in the directory.

property available_channels: List

Returns the channel names that are available for use, excluding pulseId, defined by the json file

get_channels_by_format(formats)

Returns a list of channels with the specified format.

Parameters:

formats (List[str]) – The desired formats (‘per_pulse’, ‘per_electron’, or ‘per_train’).

Returns:

A list of channels with the specified format(s).

Return type:

List

reset_multi_index()

Resets the index per pulse and electron

Return type:

None

create_multi_index_per_electron(h5_file)
Creates an index per electron using pulseId for usage with the electron

resolved pandas DataFrame.

Parameters:

h5_file (h5py.File) – The HDF5 file object.

Return type:

None

Notes

  • This method relies on the ‘pulseId’ channel to determine

    the macrobunch IDs.

  • It creates a MultiIndex with trainId, pulseId, and electronId

    as the index levels.

create_multi_index_per_pulse(train_id, np_array)

Creates an index per pulse using a pulse resolved channel’s macrobunch ID, for usage with the pulse resolved pandas DataFrame.

Parameters:
  • train_id (Series) – The train ID Series.

  • np_array (np.ndarray) – The numpy array containing the pulse resolved data.

Return type:

None

Notes

  • This method creates a MultiIndex with trainId and pulseId as the index levels.

create_numpy_array_per_channel(h5_file, channel)

Returns a numpy array for a given channel name for a given file.

Parameters:
  • h5_file (h5py.File) – The h5py file object.

  • channel (str) – The name of the channel.

Returns:

A tuple containing the train ID Series and the numpy array for the channel’s data.

Return type:

Tuple[Series, np.ndarray]

create_dataframe_per_electron(np_array, train_id, channel)

Returns a pandas DataFrame for a given channel name of type [per electron].

Parameters:
  • np_array (np.ndarray) – The numpy array containing the channel data.

  • train_id (Series) – The train ID Series.

  • channel (str) – The name of the channel.

Returns:

The pandas DataFrame for the channel’s data.

Return type:

DataFrame

Notes

The microbunch resolved data is exploded and converted to a DataFrame. The MultiIndex is set, and the NaN values are dropped, alongside the pulseId = 0 (meaningless).

create_dataframe_per_pulse(np_array, train_id, channel, channel_dict)

Returns a pandas DataFrame for a given channel name of type [per pulse].

Parameters:
  • np_array (np.ndarray) – The numpy array containing the channel data.

  • train_id (Series) – The train ID Series.

  • channel (str) – The name of the channel.

  • channel_dict (dict) – The dictionary containing channel parameters.

Returns:

The pandas DataFrame for the channel’s data.

Return type:

DataFrame

Notes

  • For auxillary channels, the macrobunch resolved data is repeated 499 times to be compared to electron resolved data for each auxillary channel. The data is then converted to a multicolumn DataFrame.

  • For all other pulse resolved channels, the macrobunch resolved data is exploded to a DataFrame and the MultiIndex is set.

create_dataframe_per_train(np_array, train_id, channel)

Returns a pandas DataFrame for a given channel name of type [per train].

Parameters:
  • np_array (np.ndarray) – The numpy array containing the channel data.

  • train_id (Series) – The train ID Series.

  • channel (str) – The name of the channel.

Returns:

The pandas DataFrame for the channel’s data.

Return type:

DataFrame

create_dataframe_per_channel(h5_file, channel)

Returns a pandas DataFrame for a given channel name from a given file.

This method takes an h5py.File object h5_file and a channel name channel, and returns a pandas DataFrame containing the data for that channel from the file. The format of the DataFrame depends on the channel’s format specified in the configuration.

Parameters:
  • h5_file (h5py.File) – The h5py.File object representing the HDF5 file.

  • channel (str) – The name of the channel.

Returns:

A pandas Series or DataFrame representing the channel’s data.

Return type:

Union[Series, DataFrame]

Raises:

ValueError – If the channel has an undefined format.

concatenate_channels(h5_file)

Concatenates the channels from the provided h5py.File into a pandas DataFrame.

This method takes an h5py.File object h5_file and concatenates the channels present in the file into a single pandas DataFrame. The concatenation is performed based on the available channels specified in the configuration.

Parameters:

h5_file (h5py.File) – The h5py.File object representing the HDF5 file.

Returns:

A concatenated pandas DataFrame containing the channels.

Return type:

DataFrame

Raises:

ValueError – If the group_name for any channel does not exist in the file.

create_dataframe_per_file(file_path)

Create pandas DataFrames for the given file.

This method loads an HDF5 file specified by file_path and constructs a pandas DataFrame from the datasets within the file. The order of datasets in the DataFrames is the opposite of the order specified by channel names.

Parameters:

file_path (Path) – Path to the input HDF5 file.

Returns:

pandas DataFrame

Return type:

DataFrame

create_buffer_file(h5_path, parquet_path)

Converts an HDF5 file to Parquet format to create a buffer file.

This method uses create_dataframe_per_file method to create dataframes from individual files within an HDF5 file. The resulting dataframe is then saved to a Parquet file.

Parameters:
  • h5_path (Path) – Path to the input HDF5 file.

  • parquet_path (Path) – Path to the output Parquet file.

Raises:

ValueError – If an error occurs during the conversion process.

Return type:

typing.Union[bool, Exception]

buffer_file_handler(data_parquet_dir, detector)

Handles the conversion of buffer files (h5 to parquet) and returns the filenames.

Parameters:
  • data_parquet_dir (Path) – Directory where the parquet files will be stored.

  • detector (str) – Detector name.

Returns:

Two lists, one for h5 file paths and one for corresponding parquet file paths.

Return type:

Tuple[List[Path], List[Path]]

Raises:

FileNotFoundError – If the conversion fails for any files or no data is available.

parquet_handler(data_parquet_dir, detector='', parquet_path=None, converted=False, load_parquet=False, save_parquet=False)

Handles loading and saving of parquet files based on the provided parameters.

Parameters:
  • data_parquet_dir (Path) – Directory where the parquet files are located.

  • detector (str, optional) – Adds a identifier for parquets to distinguish multidetector systems.

  • parquet_path (str, optional) – Path to the combined parquet file.

  • converted (bool, optional) – True if data is augmented by adding additional columns externally and saved into converted folder.

  • load_parquet (bool, optional) – Loads the entire parquet into the dd dataframe.

  • save_parquet (bool, optional) – Saves the entire dataframe into a parquet.

Returns:

A tuple containing two dataframes: - dataframe_electron: Dataframe containing the loaded/augmented electron data. - dataframe_pulse: Dataframe containing the loaded/augmented timed data.

Return type:

tuple

Raises:

FileNotFoundError – If the requested parquet file is not found.

parse_metadata()

Uses the MetadataRetriever class to fetch metadata from scicat for each run.

Returns:

Metadata dictionary

Return type:

dict

get_count_rate(fids=None, **kwds)

Create count rate data for the files specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

Arrays containing countrate and seconds into the scan.

Return type:

Tuple[np.ndarray, np.ndarray]

get_elapsed_time(fids=None, **kwds)

Return the elapsed time in the specified in fids.

Parameters:
  • fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.

  • kwds – Keyword arguments

Returns:

The elapsed time in the files in seconds.

Return type:

float

read_dataframe(files=None, folders=None, runs=None, ftype='h5', metadata=None, collect_metadata=False, **kwds)

Read express data from the DAQ, generating a parquet in between.

Parameters:
  • files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.

  • folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.

  • runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by folders. Takes precendence over files and folders. Defaults to None.

  • ftype (str, optional) – The file extension type. Defaults to “h5”.

  • metadata (dict, optional) – Additional metadata. Defaults to None.

  • collect_metadata (bool, optional) – Whether to collect metadata. Defaults to False.

Returns:

A tuple containing the concatenated DataFrame and metadata.

Return type:

Tuple[dd.DataFrame, dict]

Raises:
  • ValueError – If neither ‘runs’ nor ‘files’/’data_raw_dir’ is provided.

  • FileNotFoundError – If the conversion fails for some files or no data is available.

files: List[str]
runs: List[str]
metadata: Dict[Any, Any]
sed.loader.flash.loader.LOADER

alias of FlashLoader

3.6. Utilities

Utilities for loaders

sed.loader.utils.gather_files(folder, extension, f_start=None, f_end=None, f_step=1, file_sorting=True)

Collects and sorts files with specified extension from a given folder.

Parameters:
  • folder (str) – The folder to search

  • extension (str) – File extension used for glob.glob().

  • f_start (int, optional) – Start file id used to construct a file selector. Defaults to None.

  • f_end (int, optional) – End file id used to construct a file selector. Defaults to None.

  • f_step (int, optional) – Step of file id incrementation, used to construct a file selector. Defaults to 1.

  • file_sorting (bool, optional) – Option to sort the files by their names. Defaults to True.

Returns:

List of collected file names.

Return type:

List[str]

sed.loader.utils.parse_h5_keys(h5_file, prefix='')

Helper method which parses the channels present in the h5 file :type h5_file: h5py._hl.files.File :param h5_file: The H5 file object. :type h5_file: h5py.File :type prefix: str, default: '' :param prefix: The prefix for the channel names. :type prefix: str, optional :param Defaults to an empty string.:

Returns:

A list of channel names in the H5 file.

Return type:

List[str]

Raises:

Exception – If an error occurs while parsing the keys.

sed.loader.utils.split_channel_bitwise(df, input_column, output_columns, bit_mask, overwrite=False, types=None)

Splits a channel into two channels bitwise.

This function splits a channel into two channels by separating the first n bits from the remaining bits. The first n bits are stored in the first output column, the remaining bits are stored in the second output column.

Parameters:
  • df (dask.dataframe.DataFrame) – Dataframe to use.

  • input_column (str) – Name of the column to split.

  • output_columns (Sequence[str]) – Names of the columns to create.

  • bit_mask (int) – Bit mask to use for splitting.

  • overwrite (bool, optional) – Whether to overwrite existing columns. Defaults to False.

  • types (Sequence[type], optional) – Types of the new columns.

Returns:

Dataframe with the new columns.

Return type:

dask.dataframe.DataFrame

module sed.loader.mirrorutil, code for transparently mirroring file system trees to a second (local) location. This is speeds up binning of data stored on network drives tremendiously. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig

class sed.loader.mirrorutil.CopyTool(source, dest, **kwds)

Bases: object

File collecting and sorting class.

Parameters:
  • source (str) – Dource path for the copy tool.

  • dest (str) – Destination path for the copy tool.

copy(source, force_copy=False, **compute_kwds)

Local file copying method.

Parameters:
  • source (str) – source path

  • force_copy (bool, optional) – re-copy all files. Defaults to False.

Raises:
  • FileNotFoundError – Raised if the source path is not found or empty.

  • OSError – Raised if the target disk is full.

Returns:

Path of the copied source directory mapped into the target tree

Return type:

str

size(sdir)

Calculate file size.

Parameters:

sdir (str) – Path to source directory

Returns:

Size of files in source path

Return type:

int

cleanup_oldest_scan(force=False, report=False)

Remove scans in old directories. Looks for the directory with the oldest ctime and queries the user to confirm for its deletion.

Parameters:
  • force (bool, optional) – Forces to automatically remove the oldest scan. Defaults to False.

  • report (bool, optional) – Print a report with all directories in dest, sorted by age. Defaults to False.

Raises:

FileNotFoundError – Raised if no scans to remove are found.

sed.loader.mirrorutil.get_target_dir(sdir, source, dest, gid, mode, create=False)

Retrieve target directory.

Parameters:
  • sdir (str) – Source directory to copy

  • source (str) – source root path

  • dest (str) – destination root path

  • gid (int) – Group id

  • mode (int) – Unix mode

  • create (bool, optional) – Wether to create directories. Defaults to False.

Raises:
  • NotADirectoryError – Raised if sdir is not a directory

  • ValueError – Raised if sdir not inside of source

Returns:

The mapped targed directory inside dest

Return type:

str

sed.loader.mirrorutil.mymakedirs(path, mode, gid)

Creates a directory path iteratively from its root

Parameters:
  • path (str) – Path of the directory to create

  • mode (int) – Unix access mode of created directories

  • gid (int) – Group id of created directories

Returns:

Path of created directories

Return type:

str

sed.loader.mirrorutil.mycopy(source, dest, gid, mode, replace=False)

Copy function with option to delete the target file firs (to take ownership).

Parameters:
  • source (str) – Path to the source file

  • dest (str) – Path to the destination file

  • gid (int) – Group id to be set for the destination file

  • mode (int) – Unix access mode to be set for the destination file

  • replace (bool, optional) – Option to replace an existing file. Defaults to False.