4. Binning
4.1. Main functions
sed.binning module easy access APIs
- sed.binning.bin_dataframe(df, bins=100, axes=None, ranges=None, hist_mode='numba', mode='fast', jitter=None, pbar=True, n_cores=1, threads_per_worker=4, threadpool_api='blas', return_partitions=False, **kwds)
Computes the n-dimensional histogram on columns of a dataframe, parallelized.
- Parameters:
df (dask.dataframe.DataFrame) –
a dask.DataFrame on which to perform the histogram. bins (int, dict, Sequence[int], Sequence[np.ndarray], Sequence[tuple], optional): Definition of the bins. Can be any of the following cases:
an integer describing the number of bins for all dimensions. This requires “ranges” to be defined as well.
A sequence containing one entry of the following types for each dimenstion:
an integer describing the number of bins. This requires “ranges” to be defined as well.
a np.arrays defining the bin centers
a tuple of 3 numbers describing start, end and step of the binning range.
a dictionary made of the axes as keys and any of the above as values.
The last option takes priority over the axes and range arguments. Defaults to 100.
axes (Sequence[str], optional) – Sequence containing the names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array. Only not required if bins are provided as dictionary containing the axis names. Defaults to None.
ranges (Sequence[Tuple[float, float]], optional) – Sequence of tuples containing the start and end point of the binning range. Required if bins given as int or Sequence[int]. Defaults to None.
hist_mode (str, optional) –
Histogram calculation method.
”numpy”: use
numpy.histogramdd,”numba” use a numba powered similar method.
Defaults to “numba”.
mode (str, optional) –
Defines how the results from each partition are combined.
’fast’: Uses parallelized recombination of results.
’lean’: Store all partition results in a list, and recombine at the end.
’legacy’: Single-core recombination of partition results.
Defaults to “fast”.
jitter (Union[list, dict], optional) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings. Warning: this is not the most performing approach. applying jitter on the dataframe before calling the binning is much faster. Defaults to None.
pbar (bool, optional) – Option to show the tqdm progress bar. Defaults to True.
n_cores (int, optional) – Number of CPU cores to use for parallelization. Defaults to all but one of the available cores. Defaults to N_CPU-1.
threads_per_worker (int, optional) – Limit the number of threads that multiprocessing can spawn. Defaults to 4.
threadpool_api (str, optional) – The API to use for multiprocessing. Defaults to “blas”.
return_partitions (bool, optional) – Option to return a hypercube of dimension n+1, where the last dimension corresponds to the dataframe partitions. Defaults to False.
**kwds – Keyword arguments passed to
dask.compute()bins (
typing.Union[int,dict,typing.Sequence[int],typing.Sequence[numpy.ndarray],typing.Sequence[tuple]], default:100) –
- Raises:
Warning – Warns if there are unimplemented features the user is trying to use.
ValueError – Raised when there is a mismatch in dimensions between the binning parameters.
- Returns:
The result of the n-dimensional binning represented in an xarray object, combining the data with the axes (bin centers).
- Return type:
xr.DataArray
- sed.binning.bin_partition(part, bins=100, axes=None, ranges=None, hist_mode='numba', jitter=None, return_edges=False, skip_test=False)
Compute the n-dimensional histogram of a single dataframe partition.
- Parameters:
part (Union[dask.dataframe.DataFrame, pd.DataFrame]) – dataframe on which to perform the histogram. Usually a partition of a dask DataFrame.
bins (int, dict, Sequence[int], Sequence[np.ndarray], Sequence[tuple], optional) –
Definition of the bins. Can be any of the following cases:
an integer describing the number of bins for all dimensions. This requires “ranges” to be defined as well.
A sequence containing one entry of the following types for each dimenstion:
an integer describing the number of bins. This requires “ranges” to be defined as well.
a np.arrays defining the bin centers
a tuple of 3 numbers describing start, end and step of the binning range.
a dictionary made of the axes as keys and any of the above as values.
The last option takes priority over the axes and range arguments. Defaults to 100.
axes (Sequence[str], optional) – Sequence containing the names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array. Only not required if bins are provided as dictionary containing the axis names. Defaults to None.
ranges (Sequence[Tuple[float, float]], optional) – Sequence of tuples containing the start and end point of the binning range. Required if bins given as int or Sequence[int]. Defaults to None.
hist_mode (str, optional) –
Histogram calculation method.
”numpy”: use
numpy.histogramdd,”numba” use a numba powered similar method.
Defaults to “numba”.
jitter (Union[list, dict], optional) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings. Warning: this is not the most performing approach. Applying jitter on the dataframe before calling the binning is much faster. Defaults to None.
return_edges (bool, optional) – If True, returns a list of D arrays describing the bin edges for each dimension, similar to the behaviour of
np.histogramdd. Defaults to False.skip_test (bool, optional) – Turns off input check and data transformation. Defaults to False as it is intended for internal use only. Warning: setting this True might make error tracking difficult.
- Raises:
ValueError – When the method requested is not available.
AttributeError – if bins axes and range are not congruent in dimensionality.
KeyError – when the columns along which to compute the histogram are not present in the dataframe
- Returns:
2-element tuple returned only when returnEdges is True. Otherwise only hist is returned.
hist: The result of the n-dimensional binning
edges: A list of D arrays describing the bin edges for each dimension.
- Return type:
Union[np.ndarray, Tuple[np.ndarray, list]]
4.2. Used helper functions
This file contains code for binning using numba precompiled code for the sed.binning module
- sed.binning.numba_bin.binsearch(bins, val)
Bisection index search function.
Finds the index of the bin with the highest value below val, i.e. the left edge. returns -1 when the value is outside the bin range.
- Parameters:
bins (np.ndarray) – the array on which
val (float) – value to search for
- Returns:
index of the bin array, returns -1 when value is outside the bins range
- Return type:
int
- sed.binning.numba_bin.numba_histogramdd(sample, bins, ranges=None)
Multidimensional histogramming function, powered by Numba.
Behaves in total much like numpy.histogramdd. Returns uint32 arrays. This was chosen because it has a significant performance improvement over uint64 for large binning volumes. Be aware that this can cause overflows for very large sample sets exceeding 3E9 counts in a single bin. This should never happen in a realistic photoemission experiment with useful bin sizes.
- Parameters:
sample (np.ndarray) – The data to be histogrammed with shape N,D
bins (Union[int, Sequence[int], Sequence[np.ndarray], np.ndarray]) – The number of bins for each dimension D, or a sequence of bin edges on which to calculate the histogram.
ranges (Sequence, optional) – The range(s) to use for binning when bins is a sequence of integers or sequence of arrays. Defaults to None.
- Raises:
ValueError – In case of dimension mismatch.
TypeError – Wrong type for bins.
ValueError – In case of wrong shape of bins
RuntimeError – Internal shape error after binning
- Returns:
2-element tuple of The computed histogram and s list of D arrays describing the bin edges for each dimension.
hist: The computed histogram
edges: A list of D arrays describing the bin edges for each dimension.
- Return type:
Tuple[np.ndarray, List[np.ndarray]]
This file contains helper functions for the sed.binning module
- sed.binning.utils.simplify_binning_arguments(bins, axes=None, ranges=None)
Convert the flexible input for defining bins into a simple “axes” “bins” “ranges” tuple.
This allows to mimic the input used in numpy.histogramdd flexibility into the binning functions defined here.
- Parameters:
bins (int, dict, Sequence[int], Sequence[np.ndarray], Sequence[tuple]) –
Definition of the bins. Can be any of the following cases:
an integer describing the number of bins for all dimensions. This requires “ranges” to be defined as well.
A sequence containing one entry of the following types for each dimenstion:
an integer describing the number of bins. This requires “ranges” to be defined as well.
a np.arrays defining the bin centers
a tuple of 3 numbers describing start, end and step of the binning range.
a dictionary made of the axes as keys and any of the above as values.
The last option takes priority over the axes and range arguments.
axes (Sequence[str], optional) – Sequence containing the names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array. Only not required if bins are provided as dictionary containing the axis names. Defaults to None.
ranges (Sequence[Tuple[float, float]], optional) – Sequence of tuples containing the start and end point of the binning range. Required if bins given as int or Sequence[int]. Defaults to None.
- Raises:
ValueError – Wrong shape of bins,
TypeError – Wrong type of bins
AttributeError – Axes not defined
AttributeError – Shape mismatch
- Returns:
Tuple containing lists of bin centers, axes, and ranges.
- Return type:
Tuple[Union[List[int], List[np.ndarray]], List[Tuple[float, float]]]
- sed.binning.utils.bin_edges_to_bin_centers(bin_edges)
Converts a list of bin edges into corresponding bin centers
- Parameters:
bin_edges (
numpy.ndarray) – 1d array of bin edges- Returns:
1d array of bin centers
- Return type:
bin_centers
- sed.binning.utils.bin_centers_to_bin_edges(bin_centers)
Converts a list of bin centers into corresponding bin edges
- Parameters:
bin_centers (
numpy.ndarray) – 1d array of bin centers- Returns:
1d array of bin edges
- Return type:
bin_edges