2. Binning
sed.binning module easy access APIs
- sed.binning.bin_dataframe(df, bins=100, axes=None, ranges=None, hist_mode='numba', mode='fast', jitter=None, pbar=True, n_cores=1, threads_per_worker=4, threadpool_api='blas', return_partitions=False, **kwds)
Computes the n-dimensional histogram on columns of a dataframe, parallelized.
- Parameters
df (
dask.dataframe.core.DataFrame) – a dask.DataFrame on which to perform the histogram.bins (
typing.Union[int,dict,tuple,typing.List[int],typing.List[numpy.ndarray],typing.List[tuple]], default:100) – Definition of the bins. Can be any of the following cases: - an integer describing the number of bins in on all dimensions - a tuple of 3 numbers describing start, end and step of the binning range - a np.arrays defining the binning edges - a list (NOT a tuple) of any of the above (int, tuple or np.ndarray) - a dictionary made of the axes as keys and any of the above as values. This takes priority over the axes and range arguments.axes (
typing.Union[str,typing.Sequence[str],None], default:None) – The names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array.ranges (
typing.Optional[typing.Sequence[typing.Tuple[float,float]]], default:None) – list of tuples containing the start and end point of the binning range.hist_mode (
str, default:'numba') – Histogram calculation method. Choose between “numpy” which uses numpy.histogramdd, and “numba” which uses a numba powered similar method.mode (
str, default:'fast') – Defines how the results from each partition are combined. Available modes are ‘fast’, ‘lean’ and ‘legacy’.jitter (
typing.Union[list,dict,None], default:None) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings.pbar (
bool, default:True) – Allows to deactivate the tqdm progress bar.n_cores (
int, default:1) – Number of CPU cores to use for parallelization. Defaults to all but one of the available cores.threads_per_worker (
int, default:4) – Limit the number of threads that multiprocessing can spawn.threadpool_api (
str, default:'blas') – The API to use for multiprocessing.return_partitions (
bool, default:False) – Option to return a hypercube of dimension n+1, where the last dimension corresponds to the dataframe partitions.kwds – passed to dask.compute()
- Raises
Warning – Warns if there are unimplemented features the user is trying to use.
ValueError – Rises when there is a mismatch in dimensions between the binning parameters
- Returns
xarray.core.dataarray.DataArray–:- The result of the n-dimensional binning represented in an
xarray object, combining the data with the axes.
- sed.binning.bin_partition(part, bins=100, axes=None, ranges=None, hist_mode='numba', jitter=None, return_edges=False, skip_test=False)
Compute the n-dimensional histogram of a single dataframe partition.
- Parameters
part (
typing.Union[dask.dataframe.core.DataFrame,pandas.core.frame.DataFrame]) – dataframe on which to perform the histogram. Usually a partition of a dask DataFrame.bins (
typing.Union[int,dict,tuple,typing.List[int],typing.List[numpy.ndarray],typing.List[tuple]], default:100) – Definition of the bins. Can be any of the following cases: - an integer describing the number of bins in on all dimensions - a tuple of 3 numbers describing start, end and step of the binning range - a np.arrays defining the binning edges - a list (NOT a tuple) of any of the above (int, tuple or np.ndarray) - a dictionary made of the axes as keys and any of the above as values. This takes priority over the axes and range arguments.axes (
typing.Union[str,typing.Sequence[str],None], default:None) – The names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array.ranges (
typing.Optional[typing.Sequence[typing.Tuple[float,float]]], default:None) – list of tuples containing the start and end point of the binning range.histMode – Histogram calculation method. Choose between “numpy” which uses numpy.histogramdd, and “numba” which uses a numba powered similar method.
jitter (
typing.Union[list,dict,None], default:None) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings. Warning: this is not the most performing approach. applying jitter on the dataframe before calling the binning is much faster.returnEdges – If true, returns a list of D arrays describing the bin edges for each dimension, similar to the behaviour of np.histogramdd.
skipTest – turns off input check and data transformation. Defaults to False as it is intended for internal use only. Warning: setting this True might make error tracking difficult.
hist_mode (
str, default:'numba') –return_edges (
bool, default:False) –skip_test (
bool, default:False) –
- Raises
ValueError – When the method requested is not available.
AttributeError – if bins axes and range are not congruent in dimensionality.
KeyError – when the columns along which to compute the histogram are not present in the dataframe
- Returns
typing.Union[numpy.ndarray,typing.Tuple[numpy.ndarray,list]] –:- 2-element tuple returned only when returnEdges is True. Otherwise
only hist is returned.
hist: The result of the n-dimensional binning
- edges: A list of D arrays describing the bin edges for
each dimension.