2. Binning

sed.binning module easy access APIs

sed.binning.bin_dataframe(df, bins=100, axes=None, ranges=None, hist_mode='numba', mode='fast', jitter=None, pbar=True, n_cores=1, threads_per_worker=4, threadpool_api='blas', return_partitions=False, **kwds)

Computes the n-dimensional histogram on columns of a dataframe, parallelized.

Parameters
  • df (dask.dataframe.core.DataFrame) – a dask.DataFrame on which to perform the histogram.

  • bins (typing.Union[int, dict, tuple, typing.List[int], typing.List[numpy.ndarray], typing.List[tuple]], default: 100) – Definition of the bins. Can be any of the following cases: - an integer describing the number of bins in on all dimensions - a tuple of 3 numbers describing start, end and step of the binning range - a np.arrays defining the binning edges - a list (NOT a tuple) of any of the above (int, tuple or np.ndarray) - a dictionary made of the axes as keys and any of the above as values. This takes priority over the axes and range arguments.

  • axes (typing.Union[str, typing.Sequence[str], None], default: None) – The names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array.

  • ranges (typing.Optional[typing.Sequence[typing.Tuple[float, float]]], default: None) – list of tuples containing the start and end point of the binning range.

  • hist_mode (str, default: 'numba') – Histogram calculation method. Choose between “numpy” which uses numpy.histogramdd, and “numba” which uses a numba powered similar method.

  • mode (str, default: 'fast') – Defines how the results from each partition are combined. Available modes are ‘fast’, ‘lean’ and ‘legacy’.

  • jitter (typing.Union[list, dict, None], default: None) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings.

  • pbar (bool, default: True) – Allows to deactivate the tqdm progress bar.

  • n_cores (int, default: 1) – Number of CPU cores to use for parallelization. Defaults to all but one of the available cores.

  • threads_per_worker (int, default: 4) – Limit the number of threads that multiprocessing can spawn.

  • threadpool_api (str, default: 'blas') – The API to use for multiprocessing.

  • return_partitions (bool, default: False) – Option to return a hypercube of dimension n+1, where the last dimension corresponds to the dataframe partitions.

  • kwds – passed to dask.compute()

Raises
  • Warning – Warns if there are unimplemented features the user is trying to use.

  • ValueError – Rises when there is a mismatch in dimensions between the binning parameters

Returns

xarray.core.dataarray.DataArray –:

The result of the n-dimensional binning represented in an

xarray object, combining the data with the axes.

sed.binning.bin_partition(part, bins=100, axes=None, ranges=None, hist_mode='numba', jitter=None, return_edges=False, skip_test=False)

Compute the n-dimensional histogram of a single dataframe partition.

Parameters
  • part (typing.Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) – dataframe on which to perform the histogram. Usually a partition of a dask DataFrame.

  • bins (typing.Union[int, dict, tuple, typing.List[int], typing.List[numpy.ndarray], typing.List[tuple]], default: 100) – Definition of the bins. Can be any of the following cases: - an integer describing the number of bins in on all dimensions - a tuple of 3 numbers describing start, end and step of the binning range - a np.arrays defining the binning edges - a list (NOT a tuple) of any of the above (int, tuple or np.ndarray) - a dictionary made of the axes as keys and any of the above as values. This takes priority over the axes and range arguments.

  • axes (typing.Union[str, typing.Sequence[str], None], default: None) – The names of the axes (columns) on which to calculate the histogram. The order will be the order of the dimensions in the resulting array.

  • ranges (typing.Optional[typing.Sequence[typing.Tuple[float, float]]], default: None) – list of tuples containing the start and end point of the binning range.

  • histMode – Histogram calculation method. Choose between “numpy” which uses numpy.histogramdd, and “numba” which uses a numba powered similar method.

  • jitter (typing.Union[list, dict, None], default: None) – a list of the axes on which to apply jittering. To specify the jitter amplitude or method (normal or uniform noise) a dictionary can be passed. This should look like jitter={‘axis’:{‘amplitude’:0.5,’mode’:’uniform’}}. This example also shows the default behaviour, in case None is passed in the dictionary, or jitter is a list of strings. Warning: this is not the most performing approach. applying jitter on the dataframe before calling the binning is much faster.

  • returnEdges – If true, returns a list of D arrays describing the bin edges for each dimension, similar to the behaviour of np.histogramdd.

  • skipTest – turns off input check and data transformation. Defaults to False as it is intended for internal use only. Warning: setting this True might make error tracking difficult.

  • hist_mode (str, default: 'numba') –

  • return_edges (bool, default: False) –

  • skip_test (bool, default: False) –

Raises
  • ValueError – When the method requested is not available.

  • AttributeError – if bins axes and range are not congruent in dimensionality.

  • KeyError – when the columns along which to compute the histogram are not present in the dataframe

Returns

typing.Union[numpy.ndarray, typing.Tuple[numpy.ndarray, list]] –:

2-element tuple returned only when returnEdges is True. Otherwise

only hist is returned.

  • hist: The result of the n-dimensional binning

  • edges: A list of D arrays describing the bin edges for

    each dimension.