3. Data loader
3.1. Loader Interface
Interface to select a specified loader
- sed.loader.loader_interface.get_loader(loader_name, config=None)
Helper function to get the loader object from it’s given name.
- Parameters:
loader_name (str) – Name of the loader
config (dict, optional) – Configuration dictionary. Defaults to None.
- Raises:
ValueError – Raised if the loader cannot be found.
- Returns:
The loader object.
- Return type:
- sed.loader.loader_interface.get_names_of_all_loaders()
Helper function to populate a list of all available loaders.
- Returns:
List of all detected loader names.
- Return type:
List[str]
3.2. Abstract BaseLoader
The abstract class off of which to implement loaders.
- class sed.loader.base.loader.BaseLoader(config=None)
Bases:
ABCThe abstract class off of which to implement loaders.
The reader’s folder name is the identifier. For this BaseLoader with filename base/loader.py the ID becomes ‘base’
- Parameters:
config (dict, optional) – Config dictionary. Defaults to None.
meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.
-
supported_file_types:
typing.List[str] = []
- abstract read_dataframe(files=None, folders=None, runs=None, ftype=None, metadata=None, collect_metadata=False, **kwds)
Reads data from given files, folder, or runs and returns a dask dataframe and corresponding metadata.
- Parameters:
files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.
folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.
runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by
folders. Takes precendence overfilesandfolders. Defaults to None.ftype (str, optional) – File type to read (‘parquet’, ‘json’, ‘csv’, etc). If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to None.
metadata (dict, optional) – Manual metadata dictionary. Auto-generated metadata will be added to it. Defaults to None.
collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.
**kwds – keyword arguments. See description in respective loader.
- Returns:
Dask dataframe, timed dataframe and metadata read from specified files.
- Return type:
Tuple[ddf.DataFrame, dict]
- abstract get_files_from_run_id(run_id, folders=None, extension=None, **kwds)
Locate the files for a given run identifier.
- Parameters:
run_id (str) – The run identifier to locate.
folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to None.
extension (str, optional) – The file extension. Defaults to None.
kwds – Keyword arguments
- Returns:
List of files for the given run.
- Return type:
List[str]
- abstract get_count_rate(fids=None, **kwds)
Create count rate data for the files specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
Arrays containing countrate and seconds into the scan.
- Return type:
Tuple[np.ndarray, np.ndarray]
- abstract get_elapsed_time(fids=None, **kwds)
Return the elapsed time in the specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
The elapsed time in the files in seconds.
- Return type:
float
- sed.loader.base.loader.LOADER
alias of
BaseLoader
3.3. GenericLoader
module sed.loader.mpes, code for loading hdf5 files delayed into a dask dataframe. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig
- class sed.loader.generic.loader.GenericLoader(config=None)
Bases:
BaseLoaderDask implementation of the Loader. Reads from various file types using the utilities of Dask.
- Parameters:
config (dict, optional) – Config dictionary. Defaults to None.
meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.
-
supported_file_types:
typing.List[str] = ['parquet', 'csv', 'json']
- read_dataframe(files=None, folders=None, runs=None, ftype='parquet', metadata=None, collect_metadata=False, **kwds)
Read stored files from a folder into a dataframe.
- Parameters:
files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.
folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.
runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by
folders. Takes precendence overfilesandfolders. Defaults to None.ftype (str, optional) – File type to read (‘parquet’, ‘json’, ‘csv’, etc). If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to “parquet”.
metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.
collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.
**kwds – keyword arguments. See the keyword arguments for the specific file parser in``dask.dataframe`` module.
- Raises:
ValueError – Raised if neither files nor folder provided.
FileNotFoundError – Raised if the fileds or folder cannot be found.
ValueError – Raised if the file type is not supported.
- Returns:
Dask dataframe, timed dataframe and metadata read from specified files.
- Return type:
Tuple[ddf.DataFrame, dict]
- get_files_from_run_id(run_id, folders=None, extension=None, **kwds)
Locate the files for a given run identifier.
- Parameters:
run_id (str) – The run identifier to locate.
folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to None.
extension (str, optional) – The file extension. Defaults to “h5”.
kwds – Keyword arguments
- Returns:
Path to the location of run data.
- Return type:
str
- get_count_rate(fids=None, **kwds)
Create count rate data for the files specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
Arrays containing countrate and seconds into the scan.
- Return type:
Tuple[np.ndarray, np.ndarray]
- get_elapsed_time(fids=None, **kwds)
Return the elapsed time in the files specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
The elapsed time in the files in seconds.
- Return type:
float
- files: List[str]
- runs: List[str]
- metadata: Dict[Any, Any]
- sed.loader.generic.loader.LOADER
alias of
GenericLoader
3.4. MpesLoader
module sed.loader.mpes, code for loading hdf5 files delayed into a dask dataframe. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig
- sed.loader.mpes.loader.hdf5_to_dataframe(files, group_names=None, alias_dict=None, time_stamps=False, time_stamp_alias='timeStamps', ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp', **kwds)
Function to read a selection of hdf5-files, and generate a delayed dask dataframe from provided groups in the files. Optionally, aliases can be defined.
- Parameters:
files (List[str]) – A list of the file paths to load.
group_names (List[str], optional) – hdf5 group names to load. Defaults to load all groups containing “Stream”
alias_dict (Dict[str, str], optional) – Dictionary of aliases for the dataframe columns. Keys are the hdf5 groupnames, and values the aliases. If an alias is not found, its group name is used. Defaults to read the attribute “Name” from each group.
time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.
time_stamp_alias (str) – Alias name for the timestamp column. Defaults to “timeStamps”.
ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.
first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.
- Returns:
The delayed Dask DataFrame
- Return type:
ddf.DataFrame
- sed.loader.mpes.loader.hdf5_to_timed_dataframe(files, group_names=None, alias_dict=None, time_stamps=False, time_stamp_alias='timeStamps', ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp', **kwds)
Function to read a selection of hdf5-files, and generate a delayed dask dataframe from provided groups in the files. Optionally, aliases can be defined. Returns a dataframe for evenly spaced time intervals.
- Parameters:
files (List[str]) – A list of the file paths to load.
group_names (List[str], optional) – hdf5 group names to load. Defaults to load all groups containing “Stream”
alias_dict (Dict[str, str], optional) – Dictionary of aliases for the dataframe columns. Keys are the hdf5 groupnames, and values the aliases. If an alias is not found, its group name is used. Defaults to read the attribute “Name” from each group.
time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.
time_stamp_alias (str) – Alias name for the timestamp column. Defaults to “timeStamps”.
ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.
first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.
- Returns:
The delayed Dask DataFrame
- Return type:
ddf.DataFrame
- sed.loader.mpes.loader.get_groups_and_aliases(h5file, seach_pattern=None, alias_key='Name')
Read groups and aliases from a provided hdf5 file handle
- Parameters:
h5file (h5py.File) – The hdf5 file handle
seach_pattern (str, optional) – Search pattern to select groups. Defaults to include all groups.
alias_key (str, optional) – Attribute key where aliases are stored. Defaults to “Name”.
- Returns:
The list of groupnames and the alias dictionary parsed from the file
- Return type:
Tuple[List[str], Dict[str, str]]
- sed.loader.mpes.loader.hdf5_to_array(h5file, group_names, data_type='float32', time_stamps=False, ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp')
Reads the content of the given groups in an hdf5 file, and returns a 2-dimensional array with the corresponding values.
- Parameters:
h5file (h5py.File) – hdf5 file handle to read from
group_names (str) – group names to read
data_type (str, optional) – Data type of the output data. Defaults to “float32”.
time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.
ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.
first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.
- Returns:
The 2-dimensional data array containing the values of the groups.
- Return type:
np.ndarray
- sed.loader.mpes.loader.hdf5_to_timed_array(h5file, group_names, data_type='float32', time_stamps=False, ms_markers_group='msMarkers', first_event_time_stamp_key='FirstEventTimeStamp')
Reads the content of the given groups in an hdf5 file, and returns a timed version of a 2-dimensional array with the corresponding values.
- Parameters:
h5file (h5py.File) – hdf5 file handle to read from
group_names (str) – group names to read
data_type (str, optional) – Data type of the output data. Defaults to “float32”.
time_stamps (bool, optional) – Option to calculate time stamps. Defaults to False.
ms_markers_group (str) – h5 column containing timestamp information. Defaults to “msMarkers”.
first_event_time_stamp_key (str) – h5 attribute containing the start timestamp of a file. Defaults to “FirstEventTimeStamp”.
- Returns:
the array of the values at evently spaced timing obtained from the ms_markers.
- Return type:
np.ndarray
- sed.loader.mpes.loader.get_attribute(h5group, attribute)
Reads, decodes and returns an attrubute from an hdf5 group
- Parameters:
h5group (h5py.Group) – The hdf5 group to read from
attribute (str) – The name of the attribute
- Returns:
The parsed attribute data
- Return type:
str
- sed.loader.mpes.loader.get_count_rate(h5file, ms_markers_group='msMarkers')
Create count rate in the file from the msMarker column.
- Parameters:
h5file (h5py.File) – The h5file from which to get the count rate.
ms_markers_group (str, optional) – The hdf5 group where the millisecond markers are stored. Defaults to “msMarkers”.
- Returns:
The count rate in Hz and the seconds into the scan.
- Return type:
Tuple[np.ndarray, np.ndarray]
- sed.loader.mpes.loader.get_elapsed_time(h5file, ms_markers_group='msMarkers')
Return the elapsed time in the file from the msMarkers wave
- Parameters:
h5file (h5py.File) – The h5file from which to get the count rate.
ms_markers_group (str, optional) – The hdf5 group where the millisecond markers are stored. Defaults to “msMarkers”.
- Returns:
The acquision time of the file in seconds.
- Return type:
float
- class sed.loader.mpes.loader.MpesLoader(config=None)
Bases:
BaseLoaderMpes implementation of the Loader. Reads from h5 files or folders of the SPECS Metis 1000 (FHI Berlin)
- Parameters:
config (dict, optional) – Config dictionary. Defaults to None.
meta_handler (MetaHandler, optional) – MetaHandler object. Defaults to None.
-
supported_file_types:
typing.List[str] = ['h5']
- read_dataframe(files=None, folders=None, runs=None, ftype='h5', metadata=None, collect_metadata=False, time_stamps=False, **kwds)
Read stored hdf5 files from a list or from folder and returns a dask dataframe and corresponding metadata.
- Parameters:
files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.
folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored. Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.
runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by
folders. Takes precendence overfilesandfolders. Defaults to None.ftype (str, optional) – File extension to use. If a folder path is given, all files with the specified extension are read into the dataframe in the reading order. Defaults to “h5”.
metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.
collect_metadata (bool) – Option to collect metadata from files. Requires a valid config dict. Defaults to False.
time_stamps (bool, optional) – Option to create a time_stamp column in the dataframe from ms-Markers in the files. Defaults to False.
**kwds –
Keyword parameters.
hdf5_groupnames : List of groupnames to look for in the file.
hdf5_aliases: Dictionary of aliases for the groupnames.
time_stamp_alias: Alias for the timestamp column
ms_markers_group: Group name of the millisecond marker column.
first_event_time_stamp_key: Attribute name containing the start timestamp of the file.
Additional keywords are passed to
hdf5_to_dataframe.
- Raises:
ValueError – raised if neither files or folder provided.
FileNotFoundError – Raised if a file or folder is not found.
- Returns:
Dask dataframe, timed Dask dataframe and metadata read from specified files.
- Return type:
Tuple[ddf.DataFrame, ddf.DataFrame, dict]
- get_files_from_run_id(run_id, folders=None, extension='h5', **kwds)
Locate the files for a given run identifier.
- Parameters:
run_id (str) – The run identifier to locate.
folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to config[“core”][“base_folder”]
extension (str, optional) – The file extension. Defaults to “h5”.
kwds – Keyword arguments
- Returns:
List of file path strings to the location of run data.
- Return type:
List[str]
- gather_metadata(files, metadata=None)
Collect meta data from files
- Parameters:
files (Sequence[str]) – List of files loaded
metadata (dict, optional) – Manual meta data dictionary. Auto-generated meta data are added to it. Defaults to None.
- Returns:
The completed metadata dictionary.
- Return type:
dict
- get_count_rate(fids=None, **kwds)
Create count rate from the msMarker column for the files specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds –
Keyword arguments:
ms_markers_group: Name of the hdf5 group containing the ms-markers
- Returns:
Arrays containing countrate and seconds into the scan.
- Return type:
Tuple[np.ndarray, np.ndarray]
- get_elapsed_time(fids=None, **kwds)
Return the elapsed time in the files specified in
fidsfrom the msMarkers column.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds –
Keyword arguments:
ms_markers_group: Name of the hdf5 group containing the ms-markers
- Returns:
The elapsed time in the files in seconds.
- Return type:
float
- files: List[str]
- runs: List[str]
- metadata: Dict[Any, Any]
- sed.loader.mpes.loader.LOADER
alias of
MpesLoader
3.5. FlashLoader
This module implements the flash data loader. This loader currently supports hextof, wespe and instruments with similar structure. The raw hdf5 data is combined and saved into buffer files and loaded as a dask dataframe. The dataframe is a amalgamation of all h5 files for a combination of runs, where the NaNs are automatically forward filled across different files. This can then be saved as a parquet for out-of-sed processing and reread back to access other sed funtionality.
- class sed.loader.flash.loader.FlashLoader(config)
Bases:
BaseLoaderThe class generates multiindexed multidimensional pandas dataframes from the new FLASH dataformat resolved by both macro and microbunches alongside electrons. Only the read_dataframe (inherited and implemented) method is accessed by other modules.
- Parameters:
config (
dict) –
-
supported_file_types:
typing.List[str] = ['h5']
- initialize_paths()
Initializes the paths based on the configuration.
- Returns:
A tuple containing a list of raw data directories paths and the parquet data directory path.
- Return type:
Tuple[List[Path], Path]
- Raises:
ValueError – If required values are missing from the configuration.
FileNotFoundError – If the raw data directories are not found.
- get_files_from_run_id(run_id, folders=None, extension='h5', **kwds)
Returns a list of filenames for a given run located in the specified directory for the specified data acquisition (daq).
- Parameters:
run_id (str) – The run identifier to locate.
folders (Union[str, Sequence[str]], optional) – The directory(ies) where the raw data is located. Defaults to config[“core”][“base_folder”].
extension (str, optional) – The file extension. Defaults to “h5”.
kwds – Keyword arguments: - daq (str): The data acquisition identifier. Defaults to config[“dataframe”][“daq”].
- Returns:
A list of path strings representing the collected file names.
- Return type:
List[str]
- Raises:
FileNotFoundError – If no files are found for the given run in the directory.
- property available_channels: List
Returns the channel names that are available for use, excluding pulseId, defined by the json file
- get_channels_by_format(formats)
Returns a list of channels with the specified format.
- Parameters:
formats (List[str]) – The desired formats (‘per_pulse’, ‘per_electron’, or ‘per_train’).
- Returns:
A list of channels with the specified format(s).
- Return type:
List
- reset_multi_index()
Resets the index per pulse and electron
- Return type:
None
- create_multi_index_per_electron(h5_file)
- Creates an index per electron using pulseId for usage with the electron
resolved pandas DataFrame.
- Parameters:
h5_file (h5py.File) – The HDF5 file object.
- Return type:
None
Notes
- This method relies on the ‘pulseId’ channel to determine
the macrobunch IDs.
- It creates a MultiIndex with trainId, pulseId, and electronId
as the index levels.
- create_multi_index_per_pulse(train_id, np_array)
Creates an index per pulse using a pulse resolved channel’s macrobunch ID, for usage with the pulse resolved pandas DataFrame.
- Parameters:
train_id (Series) – The train ID Series.
np_array (np.ndarray) – The numpy array containing the pulse resolved data.
- Return type:
None
Notes
This method creates a MultiIndex with trainId and pulseId as the index levels.
- create_numpy_array_per_channel(h5_file, channel)
Returns a numpy array for a given channel name for a given file.
- Parameters:
h5_file (h5py.File) – The h5py file object.
channel (str) – The name of the channel.
- Returns:
A tuple containing the train ID Series and the numpy array for the channel’s data.
- Return type:
Tuple[Series, np.ndarray]
- create_dataframe_per_electron(np_array, train_id, channel)
Returns a pandas DataFrame for a given channel name of type [per electron].
- Parameters:
np_array (np.ndarray) – The numpy array containing the channel data.
train_id (Series) – The train ID Series.
channel (str) – The name of the channel.
- Returns:
The pandas DataFrame for the channel’s data.
- Return type:
DataFrame
Notes
The microbunch resolved data is exploded and converted to a DataFrame. The MultiIndex is set, and the NaN values are dropped, alongside the pulseId = 0 (meaningless).
- create_dataframe_per_pulse(np_array, train_id, channel, channel_dict)
Returns a pandas DataFrame for a given channel name of type [per pulse].
- Parameters:
np_array (np.ndarray) – The numpy array containing the channel data.
train_id (Series) – The train ID Series.
channel (str) – The name of the channel.
channel_dict (dict) – The dictionary containing channel parameters.
- Returns:
The pandas DataFrame for the channel’s data.
- Return type:
DataFrame
Notes
For auxillary channels, the macrobunch resolved data is repeated 499 times to be compared to electron resolved data for each auxillary channel. The data is then converted to a multicolumn DataFrame.
For all other pulse resolved channels, the macrobunch resolved data is exploded to a DataFrame and the MultiIndex is set.
- create_dataframe_per_train(np_array, train_id, channel)
Returns a pandas DataFrame for a given channel name of type [per train].
- Parameters:
np_array (np.ndarray) – The numpy array containing the channel data.
train_id (Series) – The train ID Series.
channel (str) – The name of the channel.
- Returns:
The pandas DataFrame for the channel’s data.
- Return type:
DataFrame
- create_dataframe_per_channel(h5_file, channel)
Returns a pandas DataFrame for a given channel name from a given file.
This method takes an h5py.File object h5_file and a channel name channel, and returns a pandas DataFrame containing the data for that channel from the file. The format of the DataFrame depends on the channel’s format specified in the configuration.
- Parameters:
h5_file (h5py.File) – The h5py.File object representing the HDF5 file.
channel (str) – The name of the channel.
- Returns:
A pandas Series or DataFrame representing the channel’s data.
- Return type:
Union[Series, DataFrame]
- Raises:
ValueError – If the channel has an undefined format.
- concatenate_channels(h5_file)
Concatenates the channels from the provided h5py.File into a pandas DataFrame.
This method takes an h5py.File object h5_file and concatenates the channels present in the file into a single pandas DataFrame. The concatenation is performed based on the available channels specified in the configuration.
- Parameters:
h5_file (h5py.File) – The h5py.File object representing the HDF5 file.
- Returns:
A concatenated pandas DataFrame containing the channels.
- Return type:
DataFrame
- Raises:
ValueError – If the group_name for any channel does not exist in the file.
- create_dataframe_per_file(file_path)
Create pandas DataFrames for the given file.
This method loads an HDF5 file specified by file_path and constructs a pandas DataFrame from the datasets within the file. The order of datasets in the DataFrames is the opposite of the order specified by channel names.
- Parameters:
file_path (Path) – Path to the input HDF5 file.
- Returns:
pandas DataFrame
- Return type:
DataFrame
- create_buffer_file(h5_path, parquet_path)
Converts an HDF5 file to Parquet format to create a buffer file.
This method uses create_dataframe_per_file method to create dataframes from individual files within an HDF5 file. The resulting dataframe is then saved to a Parquet file.
- Parameters:
h5_path (Path) – Path to the input HDF5 file.
parquet_path (Path) – Path to the output Parquet file.
- Raises:
ValueError – If an error occurs during the conversion process.
- Return type:
typing.Union[bool,Exception]
- buffer_file_handler(data_parquet_dir, detector)
Handles the conversion of buffer files (h5 to parquet) and returns the filenames.
- Parameters:
data_parquet_dir (Path) – Directory where the parquet files will be stored.
detector (str) – Detector name.
- Returns:
Two lists, one for h5 file paths and one for corresponding parquet file paths.
- Return type:
Tuple[List[Path], List[Path]]
- Raises:
FileNotFoundError – If the conversion fails for any files or no data is available.
- parquet_handler(data_parquet_dir, detector='', parquet_path=None, converted=False, load_parquet=False, save_parquet=False)
Handles loading and saving of parquet files based on the provided parameters.
- Parameters:
data_parquet_dir (Path) – Directory where the parquet files are located.
detector (str, optional) – Adds a identifier for parquets to distinguish multidetector systems.
parquet_path (str, optional) – Path to the combined parquet file.
converted (bool, optional) – True if data is augmented by adding additional columns externally and saved into converted folder.
load_parquet (bool, optional) – Loads the entire parquet into the dd dataframe.
save_parquet (bool, optional) – Saves the entire dataframe into a parquet.
- Returns:
A tuple containing two dataframes: - dataframe_electron: Dataframe containing the loaded/augmented electron data. - dataframe_pulse: Dataframe containing the loaded/augmented timed data.
- Return type:
tuple
- Raises:
FileNotFoundError – If the requested parquet file is not found.
- parse_metadata()
Uses the MetadataRetriever class to fetch metadata from scicat for each run.
- Returns:
Metadata dictionary
- Return type:
dict
- get_count_rate(fids=None, **kwds)
Create count rate data for the files specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
Arrays containing countrate and seconds into the scan.
- Return type:
Tuple[np.ndarray, np.ndarray]
- get_elapsed_time(fids=None, **kwds)
Return the elapsed time in the specified in
fids.- Parameters:
fids (Sequence[int], optional) – fids (Sequence[int]): the file ids to include. Defaults to list of all file ids.
kwds – Keyword arguments
- Returns:
The elapsed time in the files in seconds.
- Return type:
float
- read_dataframe(files=None, folders=None, runs=None, ftype='h5', metadata=None, collect_metadata=False, **kwds)
Read express data from the DAQ, generating a parquet in between.
- Parameters:
files (Union[str, Sequence[str]], optional) – File path(s) to process. Defaults to None.
folders (Union[str, Sequence[str]], optional) – Path to folder(s) where files are stored Path has priority such that if it’s specified, the specified files will be ignored. Defaults to None.
runs (Union[str, Sequence[str]], optional) – Run identifier(s). Corresponding files will be located in the location provided by
folders. Takes precendence overfilesandfolders. Defaults to None.ftype (str, optional) – The file extension type. Defaults to “h5”.
metadata (dict, optional) – Additional metadata. Defaults to None.
collect_metadata (bool, optional) – Whether to collect metadata. Defaults to False.
- Returns:
A tuple containing the concatenated DataFrame and metadata.
- Return type:
Tuple[dd.DataFrame, dict]
- Raises:
ValueError – If neither ‘runs’ nor ‘files’/’data_raw_dir’ is provided.
FileNotFoundError – If the conversion fails for some files or no data is available.
- files: List[str]
- runs: List[str]
- metadata: Dict[Any, Any]
- sed.loader.flash.loader.LOADER
alias of
FlashLoader
3.6. Utilities
Utilities for loaders
- sed.loader.utils.gather_files(folder, extension, f_start=None, f_end=None, f_step=1, file_sorting=True)
Collects and sorts files with specified extension from a given folder.
- Parameters:
folder (str) – The folder to search
extension (str) – File extension used for glob.glob().
f_start (int, optional) – Start file id used to construct a file selector. Defaults to None.
f_end (int, optional) – End file id used to construct a file selector. Defaults to None.
f_step (int, optional) – Step of file id incrementation, used to construct a file selector. Defaults to 1.
file_sorting (bool, optional) – Option to sort the files by their names. Defaults to True.
- Returns:
List of collected file names.
- Return type:
List[str]
- sed.loader.utils.parse_h5_keys(h5_file, prefix='')
Helper method which parses the channels present in the h5 file :type h5_file:
h5py._hl.files.File:param h5_file: The H5 file object. :type h5_file: h5py.File :type prefix:str, default:'':param prefix: The prefix for the channel names. :type prefix: str, optional :param Defaults to an empty string.:- Returns:
A list of channel names in the H5 file.
- Return type:
List[str]
- Raises:
Exception – If an error occurs while parsing the keys.
- sed.loader.utils.split_channel_bitwise(df, input_column, output_columns, bit_mask, overwrite=False, types=None)
Splits a channel into two channels bitwise.
This function splits a channel into two channels by separating the first n bits from the remaining bits. The first n bits are stored in the first output column, the remaining bits are stored in the second output column.
- Parameters:
df (dask.dataframe.DataFrame) – Dataframe to use.
input_column (str) – Name of the column to split.
output_columns (Sequence[str]) – Names of the columns to create.
bit_mask (int) – Bit mask to use for splitting.
overwrite (bool, optional) – Whether to overwrite existing columns. Defaults to False.
types (Sequence[type], optional) – Types of the new columns.
- Returns:
Dataframe with the new columns.
- Return type:
dask.dataframe.DataFrame
module sed.loader.mirrorutil, code for transparently mirroring file system trees to a second (local) location. This is speeds up binning of data stored on network drives tremendiously. Mostly ported from https://github.com/mpes-kit/mpes. @author: L. Rettig
- class sed.loader.mirrorutil.CopyTool(source, dest, **kwds)
Bases:
objectFile collecting and sorting class.
- Parameters:
source (str) – Dource path for the copy tool.
dest (str) – Destination path for the copy tool.
- copy(source, force_copy=False, **compute_kwds)
Local file copying method.
- Parameters:
source (str) – source path
force_copy (bool, optional) – re-copy all files. Defaults to False.
- Raises:
FileNotFoundError – Raised if the source path is not found or empty.
OSError – Raised if the target disk is full.
- Returns:
Path of the copied source directory mapped into the target tree
- Return type:
str
- size(sdir)
Calculate file size.
- Parameters:
sdir (str) – Path to source directory
- Returns:
Size of files in source path
- Return type:
int
- cleanup_oldest_scan(force=False, report=False)
Remove scans in old directories. Looks for the directory with the oldest ctime and queries the user to confirm for its deletion.
- Parameters:
force (bool, optional) – Forces to automatically remove the oldest scan. Defaults to False.
report (bool, optional) – Print a report with all directories in dest, sorted by age. Defaults to False.
- Raises:
FileNotFoundError – Raised if no scans to remove are found.
- sed.loader.mirrorutil.get_target_dir(sdir, source, dest, gid, mode, create=False)
Retrieve target directory.
- Parameters:
sdir (str) – Source directory to copy
source (str) – source root path
dest (str) – destination root path
gid (int) – Group id
mode (int) – Unix mode
create (bool, optional) – Wether to create directories. Defaults to False.
- Raises:
NotADirectoryError – Raised if sdir is not a directory
ValueError – Raised if sdir not inside of source
- Returns:
The mapped targed directory inside dest
- Return type:
str
- sed.loader.mirrorutil.mymakedirs(path, mode, gid)
Creates a directory path iteratively from its root
- Parameters:
path (str) – Path of the directory to create
mode (int) – Unix access mode of created directories
gid (int) – Group id of created directories
- Returns:
Path of created directories
- Return type:
str
- sed.loader.mirrorutil.mycopy(source, dest, gid, mode, replace=False)
Copy function with option to delete the target file firs (to take ownership).
- Parameters:
source (str) – Path to the source file
dest (str) – Path to the destination file
gid (int) – Group id to be set for the destination file
mode (int) – Unix access mode to be set for the destination file
replace (bool, optional) – Option to replace an existing file. Defaults to False.