2. Dataframe Operations

This module contains dataframe operations functions for the sed package

sed.core.dfops.apply_jitter(df, cols, cols_jittered=None, amps=0.5, jitter_type='uniform')

Add jittering to one or more dataframe columns.

Parameters:
  • df (Union[pd.DataFrame, dask.dataframe.DataFrame]) – Dataframe to add noise/jittering to.

  • cols (Union[str, Sequence[str]]) – Names of the columns to add jittering to.

  • cols_jittered (Union[str, Sequence[str]], optional) – Names of the columns with added jitter. Defaults to None.

  • amps (Union[float, Sequence[float]], optional) – Amplitude scalings for the jittering noise. If one number is given, the same is used for all axes. For normal noise, the added noise will have sdev [-amp, +amp], for uniform noise it will cover the interval [-amp, +amp]. Defaults to 0.5.

  • jitter_type (str, optional) – the type of jitter to add. ‘uniform’ or ‘normal’ distributed noise. Defaults to “uniform”.

Returns:

dataframe with added columns.

Return type:

Union[pd.DataFrame, dask.dataframe.DataFrame]

sed.core.dfops.drop_column(df, column_name)

Delete columns.

Parameters:
  • df (Union[pd.DataFrame, dask.dataframe.DataFrame]) – Dataframe to use.

  • column_name (Union[str, Sequence[str]])) – List of column names to be dropped.

Returns:

Dataframe with dropped columns.

Return type:

Union[pd.DataFrame, dask.dataframe.DataFrame]

sed.core.dfops.apply_filter(df, col, lower_bound=-inf, upper_bound=inf)

Application of bound filters to a specified column (can be used consecutively).

Parameters:
  • df (Union[pd.DataFrame, dask.dataframe.DataFrame]) – Dataframe to use.

  • col (str) – Name of the column to filter.

  • lower_bound (float, optional) – The lower bound used in the filtering. Defaults to -np.inf.

  • upper_bound (float, optional) – The lower bound used in the filtering. Defaults to np.inf.

Returns:

The filtered dataframe.

Return type:

Union[pd.DataFrame, dask.dataframe.DataFrame]

sed.core.dfops.map_columns_2d(df, map_2d, x_column, y_column, **kwds)

Apply a 2-dimensional mapping simultaneously to two dimensions.

Parameters:
  • df (Union[pd.DataFrame, dask.dataframe.DataFrame]) – Dataframe to use.

  • map_2d (Callable) – 2D mapping function.

  • x_column (np.ndarray) – The X column of the dataframe to apply mapping to.

  • y_column (np.ndarray) – The Y column of the dataframe to apply mapping to.

  • **kwds – Additional arguments for the 2D mapping function.

Returns:

Dataframe with mapped columns.

Return type:

Union[pd.DataFrame, dask.dataframe.DataFrame]

sed.core.dfops.forward_fill_lazy(df, columns=None, before='max', compute_lengths=False, iterations=2)

Forward fill the specified columns multiple times in a dask dataframe.

Allows forward filling between partitions. This is useful for dataframes that have sparse data, such as those with many NaNs. Runnin the forward filling multiple times can fix the issue of having entire partitions consisting of NaNs. By default we run this twice, which is enough to fix the issue for dataframes with no consecutive partitions of NaNs.

Parameters:
  • df (dask.dataframe.DataFrame) – The dataframe to forward fill.

  • columns (list) – The columns to forward fill. If None, fills all columns

  • before (int, str, optional) – The number of rows to include before the current partition. if ‘max’ it takes as much as possible from the previous partition, which is the size of the smallest partition in the dataframe. Defaults to ‘max’.

  • compute_lengths (bool, optional) – Whether to compute the length of each partition

  • iterations (int, optional) – The number of times to forward fill the dataframe.

Returns:

The dataframe with the specified columns forward filled.

Return type:

dask.dataframe.DataFrame

sed.core.dfops.backward_fill_lazy(df, columns=None, after='max', compute_lengths=False, iterations=1)

Forward fill the specified columns multiple times in a dask dataframe.

Allows backward filling between partitions. Similar to forward fill, but backwards. This helps to fill the initial values of a dataframe, which are often NaNs. Use with care as the assumption of the values being the same in the past is often not true.

Parameters:
  • df (dask.dataframe.DataFrame) – The dataframe to forward fill.

  • columns (list) – The columns to forward fill. If None, fills all columns

  • after (int, str, optional) – The number of rows to include after the current partition. if ‘max’ it takes as much as possible from the previous partition, which is the size of the smallest partition in the dataframe. Defaults to ‘max’.

  • compute_lengths (bool, optional) – Whether to compute the length of each partition

  • iterations (int, optional) – The number of times to backward fill the dataframe.

Returns:

The dataframe with the specified columns backward filled.

Return type:

dask.dataframe.DataFrame

sed.core.dfops.offset_by_other_columns(df, target_column, offset_columns, signs, reductions=None, preserve_mean=False, inplace=True, rename=None)

Apply an offset to a column based on the values of other columns.

Parameters:
  • df (dask.dataframe.DataFrame) – Dataframe to use. Currently supports only dask dataframes.

  • target_column (str) – Name of the column to apply the offset to.

  • offset_columns (str) – Name of the column(s) to use for the offset.

  • signs (int) – Sign of the offset.

  • reductions (str, optional) – Reduction function to use for the offset. Defaults to “mean”. Currently, only mean is supported.

  • preserve_mean (bool, optional) – Whether to subtract the mean of the offset column. Defaults to False. If a list is given, it must have the same length as offset_columns. Otherwise the value passed is used for all columns.

  • inplace (bool, optional) – Whether to apply the offset inplace. If false, the new column will have the name provided by rename, or has the same name as target_column with the suffix _offset if that is None. Defaults to True.

  • rename (str, optional) – Name of the new column if inplace is False. Defaults to None.

Returns:

Dataframe with the new column.

Return type:

dask.dataframe.DataFrame