Big Data Preparation ~ Python for Transportation Engineers

In order to feed the proposed network with a certain data dimension and improve the accuracy of the model, the raw data collected by motion sensors need to be pre-processed as follows.

1. Linear Interpolation:

The datasets mentioned above are realistic and the sensors worn on the subjects are wireless. Therefore, some data may be lost during the collection process, and the lost data is usually indicated with NaN/0. To overcome this problem, the linear interpolation algorithm was used to fill the missing values in this paper.

2. Scaling and Normalization:

Using large values from channels directly to train models may lead to training bais, So it is necessary to normalize the input data to the range of 0 to 1, as shown in figure.

where n denotes the number of channels, and xi max; xi min are the maximum and minimum values of the i - th channel, respectively.

where 𝜇 and 𝜎 are the mean and standard deviation of 𝑋, respectively.

Batch Normalization:

Batch normalization is a technique commonly used in deep learning to normalize the inputs of a neural network layer. It aims to stabilize and improve the training process by normalizing the activations of the previous layer within a mini-batch. This technique helps to mitigate the internal covariate shift problem, which refers to the change in the distribution of layer inputs during training.

3. Resampaling:

* Undersampaling Method:

Undersampling is a resampling technique that reduces the number of majority class samples to balance the class distribution in imbalanced datasets. It randomly removes instances from the majority class to match the number of instances in the minority class. The aim is to prevent the model from being biased towards the majority class. The formula for undersampling is as follows:

Random undersampling: Randomly select a subset of samples from the majority class.

*Oversampaling Mehtod:

Oversampling is a resampling technique that increases the number of minority class samples to balance the class distribution. It duplicates or synthesizes new instances from the minority class to match the number of instances in the majority class. The goal is to provide more training examples for the minority class and avoid biased predictions. The formula for oversampling is as follows:

-Random oversampling: Randomly duplicate instances from the minority class.

-Synthetic Minority Over-sampling Technique (SMOTE): Create synthetic instances by interpolating between existing minority class samples.

*Regression Mthod:

Regression methods are used in resampling to estimate or predict missing values in a dataset based on the relationship between other variables. This is commonly used when there are missing values or to fill in gaps in time series data. Regression models can be trained on existing data points to predict the missing values. The formula for regression methods depends on the specific regression model being used.

3. Segmentation:

To divided data based in the spatilal and temporal requirement. for furthe analysis.

Python for Transportation Engineers

Menu

Saturday, July 8, 2023

Big Data Preparation

0 comments:

Post a Comment

About

Blogroll

Popular Posts

BTemplates.com

Blogroll

Categories

Blog Archive

Total Pageviews

Search This Blog

Search This Blog

Report Abuse

Page

About Me

Page

Essentail Python Liberies for Traffic Engineers

Contact Form