API Reference¶
- Airpeak.baseline_detection(df, pollutant, base_lambda=1000000.0, base_p=0.001)¶
Detects baseline from pollutant concentration data using Asymmetric Least Squares smoothing.
This function processes a DataFrame containing pollutant measurements and calculates the baseline signal using the baseline_als algorithm. It adds padding to the beginning and end of the data to improve edge detection. Knowing the baseline (e.g., outdoor concentration) is essential for estimating pollutant loss rate using on mass balance models.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing pollutant measurements
pollutant (str) – Column name in the DataFrame containing the pollutant measurements
base_lambda (float, optional) – Smoothing parameter for baseline_als algorithm (default is 1×10⁶)
base_p (float, optional) – Asymmetry parameter for baseline_als algorithm (default is 0.001)
- Returns:
A copy of the input DataFrame with an additional ‘baseline’ column containing the calculated baseline values
- Return type:
pandas.DataFrame
Notes
The function adds padding (100000 points) at the beginning and end of the data to improve baseline detection at the edges. These padding points are removed from the final output.
- Airpeak.dbscan(df, timestamp, eps=0.01, ms=2, transformer=RobustScaler())¶
Apply DBSCAN clustering algorithm to time series data grouped by date. This function processes time series data by applying DBSCAN clustering to identify groups of individual decay events in data labeled as “decay” by the k_means_diff function. The data is first split by date to make hyperparameter selection more universal.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing time series data with status labels
timestamp (str) – Name of the timestamp column in the DataFrame
eps (float, optional (default=0.01)) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
ms (int, optional (default=2)) – The number of samples in a neighborhood for a point to be considered as a core point.
transformer (sklearn.preprocessing object, optional (default=RobustScaler())) – Scaler to use for data preprocessing before DBSCAN clustering
- Returns:
A DataFrame containing the original data with an additional ‘decay_group’ column where: - -1 represents noise points or non-decay events - Other integers represent different decay event clusters
- Return type:
pandas.DataFrame
Notes
The function: 1. Splits data by date 2. Creates a cumulative sum of non-decay events 3. Applies DBSCAN clustering on decay events (status_label=1) 4. Merges results back with original data Any errors in processing a specific date are silently ignored (the function continues with the next date).
- Airpeak.decay_regress(df_date, pollutant, timestamp, portion=[0, 1])¶
Performs linear regression analysis on individual decay periods in time series data.
This function analyzes identified individual decay periods in pollutant concentration data, calculating decay rates and associated statistics through linear regression of log-transformed concentrations. The analysis is based on mass balance.
- Parameters:
df_date (pandas.DataFrame) – DataFrame containing the time series data with decay periods identified
pollutant (str) – Name of the column containing pollutant concentrations
timestamp (str) – Name of the column containing timestamp data
portion (list of float, optional) – Two elements list specifying the start and end portions of each decay period to use [start_fraction, end_fraction], default [0,1] uses entire periods
- Returns:
DataFrame containing regression results with columns: - pollutant: name of analyzed pollutant - time: midpoint timestamp of decay period - decay_start: start time of decay period - decay_end: end time of decay period - decay_rate: calculated decay rate coefficient - r2: R-squared value of regression - ste: standard error of regression - num_of_point: number of points in regression - base_value: baseline concentration at start - median_ele: median concentration above baseline of decay period - max_diff: maximum concentration difference of decay period - group: decay period identifier - method: analysis method identifier (‘decay’)
- Return type:
pandas.DataFrame
Notes
Requires decay periods to be pre-identified in ‘decay_group’ column
Performs log-linear regression on concentration differences from baseline
Excludes decay periods with fewer than 3 points
Uses natural logarithm for decay rate calculation
- Airpeak.feature_generation(df, pollutant, timestamp, diff_ma_window=5, diff_rhl_window=5)¶
Generate additional features from time series data of pollutant measurements. This function calculates several features based on the difference between pollutant measurements and their baseline values, including moving averages, gradients, and relative high-low metrics. :type df: :param df: Input DataFrame containing pollutant measurements and baseline values :type df: pandas.DataFrame :type pollutant: :param pollutant: Column name for the pollutant measurements :type pollutant: str :type timestamp: :param timestamp: Column name for the timestamp data :type timestamp: str :type diff_ma_window: :param diff_ma_window: Window size for moving average calculation (default is 5) :type diff_ma_window: int :type diff_rhl_window: :param diff_rhl_window: Window size for relative high-low calculation (default is 5) :type diff_rhl_window: int
- Returns:
DataFrame with original data and additional features: - min_diff: time difference between adjacent measurements in minutes - diff: difference between pollutant and baseline concentrations - diff_ma: moving average of the concentration difference - diff_gd: gradient of concentration difference - diff_gd_ln: natural log of concentration difference gradient - diff_gd_sign: binary indicator of positive/negative gradient - diff_gd_abs: absolute value of the gradient - diff_rhl: relative high-low metric (see Ref section)
- Return type:
pandas.DataFrame
Notes
The function assumes input DataFrame contains ‘baseline’ column.
Ref¶
Anghinoni, L.; Zhao, L.; Ji, D.; Pan, H. Time series trend detection and forecasting using complex network topology analysis. Neural Networks 2019, 117, 295– 306, DOI: 10.1016/J.NEUNET.2019.05.018
- Airpeak.k_means_diff(df, timestamp, n_clusters=2, scaler=MinMaxScaler(), transformer=QuantileTransformer())¶
Performs K-means clustering on concentration gradient data to classify different states in time series.
This function applies K-means clustering to identify different states (build-up, decay, and optionally plateau) in the data labeled as elevated from the previous step. It processes elevated periods in the data only and labels them according to their gradient characteristics.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing concentration gradient data
timestamp (str) – Column name for timestamp values
n_clusters (int, optional (default=2)) – Number of clusters for K-means (2 for build-up and decay states, 3 for with plateau state)
scaler (sklearn.preprocessing object, optional (default=MinMaxScaler())) – Scaler object for data normalization
transformer (sklearn.preprocessing object, optional (default=QuantileTransformer())) – Transformer object for data transformation
- Returns:
DataFrame with added ‘status_label’ column indicating state classification: - 0: baseline (non-elevated periods, copied from k_means_ele) - 1: decay - 2: buildup (for n_clusters=2) - 2: plateau, 3: buildup (for n_clusters=3)
- Return type:
pandas.DataFrame
Notes
The function expects the input DataFrame to have ‘elevated’, ‘diff_gd_ln’, and ‘diff_gd_abs’ columns. The ‘elevated’ column should be binary (0/1) indicating elevated periods.
- Airpeak.k_means_ele(df, scaler=MinMaxScaler(), transformer=QuantileTransformer())¶
Performs K-means clustering on concentration data to identify significant concentration elevations.
This function applies K-means clustering algorithm with 2 clusters on the provided dataframe using ‘diff_ma’ (moving average difference) and ‘diff_gd_abs’ (absolute gradient difference) features. The data is normalized using both scaling and quantile transformation before clustering.
- Parameters:
df (pandas.DataFrame) – Input dataframe containing columns ‘diff_ma’, ‘diff_gd_abs’, and ‘diff’
scaler (sklearn.preprocessing object, optional) – Scaler for data normalization (default is MinMaxScaler())
transformer (sklearn.preprocessing object, optional) – Transformer for data distribution (default is QuantileTransformer())
- Returns:
A copy of input dataframe with an additional boolean column ‘elevated’ indicating cluster membership. True (1) indicates elevated measurements, False (0) indicates normal measurements. Elevated measurements will be further differentiated into build-up, plateau, and decay events.
- Return type:
pandas.DataFrame
Notes
The cluster with the higher average concentration value is automatically labeled as the ‘elevated’ cluster (1), while the other cluster is labeled as normal (0).
- Airpeak.outlier_removal(df, pollutant, outlier_neighbors)¶
Remove outliers from a DataFrame using Local Outlier Factor (LOF) algorithm.
This function applies the LOF algorithm to identify and remove outliers in a specified pollutant column, considering the temporal sequence of measurements through the index.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing the pollutant measurements
pollutant (str) – Name of the column containing the pollutant data to check for outliers
outlier_neighbors (int) – Number of neighbors to consider when determining if a point is an outlier
- Returns:
A new DataFrame with outliers removed, maintaining the same structure as the input DataFrame but excluding the identified outlier rows
- Return type:
pandas.DataFrame
Notes
This is an optional step. Consider apply it if the raw data contains obvious outliers. The function uses both the temporal index and pollutant values to identify outliers. Points identified as outliers (labeled as -1 by LOF) are removed from the dataset.
- Airpeak.smoothing(df, pollutant, smoothing_window, smoothing_order)¶
Apply Savitzky-Golay filter to smooth pollutant data in a DataFrame.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing pollutant measurements
pollutant (str) – Name of the column containing pollutant data to be smoothed
smoothing_window (int) – Length of the filter window (must be odd number and greater than smoothing_order)
smoothing_order (int) – Order of the polynomial used to fit the samples
- Returns:
A new DataFrame with smoothed pollutant data
- Return type:
pandas.DataFrame
Notes
This step is optional. Consider applying it if the raw data is noisy. Uses scipy.signal.savgol_filter for smoothing, which fits a polynomial of order ‘smoothing_order’ to windows of size ‘smoothing_window’.
Example
>>> smoothed_df = smoothing(df, "CO2", 5, 3)