My Tools

    No Products in the Wishlist

Data Processing – All Methods & Functions You Should Know

In the realm of AI, data plays an absolutely critical role. Developing models is impossible without enough amount of training data. Therefore, it is essential to understand how to collect and process the required data. This article will explore the fundamental concepts and capabilities of PyTorch that enable us to efficiently work with data.

What is Data Processing?

Data processing is the process of transforming raw data into a more meaningful and useful form for decision-making in AI models. This process typically involves several steps, including data collection, data preprocessing, data analysis, data interpretation, and data visualization.

In this article, we will focus on using PyTorch to store and manipulate data for preprocessing, and how to read and use that stored data.

Data Storing

Working with data and performing mathematical operations on it is the foundation of AI. To achieve this, a proper data storage method is necessary. This is where n-dimensional arrays, also known as tensors, become essential. They allow data to be stored in an efficient and effective way. In the programming world, the Numpy library is the go-to tool for working with tensors.

However, PyTorch and other deep learning frameworks offer their own type of tensor, which is similar to Numpy’s functionality but has additional features that make it perfect for AI tasks. Pytorch is equipped with automatic differentiation and uses GPUs to accelerate numerical computation, giving it a significant advantage over NumPy, which runs only on CPUs. As a result, coding neural networks is both quick and easy.

Throughout this article, I use Python and PyTorch combination for programming, You can follow our installation guide to setup your PC properly!

Let’s see how we use tensors to store data.

A tensor represents an array of numerical values that may have several dimensions. When a tensor has just one axis or dimension, it is commonly referred to as a vector. Similarly, when a tensor has two axes, it is often called a matrix. If the tensor has k>2 axes, we just refer to the object as a kth-order tensor.

If you’re unfamiliar with these linear algebra concepts, check our comprehensive Linear Algebra guide for a deeper understanding.

There are three primary methods for creating a PyTorch tensor to store data:

First of all, we should import the PyTorch library:

1 1

torch.arange()

torch.arange() is used to create a vector.

For example, torch. arange(n) create a vector with starting at 0 (included) and ending at n (not included). By default, the interval size is 1.

x = torch.arange(6)

Output:

print(x)

Each value is an element of the tensor(The above tensor x contains 6 elements). We can use .numel() to see the total number of elements in a tensor.

x.numel()

Output:

x.numel()-output

We can see the tensor’s shape (the length along each axis) by using the shape attribute. As x is a vector, the shape contains just a single element.

x.shape

Output:

x.shape-output

.reshape() is used to change the shape of a tensor. For example, we can transform vector x to a matrix X with shape (2, 3) like this:

X = x.reshape(2,3)

Output:

print(X)

If we know our tensor’s size, we can give just one component of the shape, Rest components will be decided automatically. To do this, we can place -1 for the shape component that should be decided automatically.

So instead of calling x.reshape(2, 3), we can use x.reshape(-1, 3) or x.reshape(2, -1)

X = x.reshape(2,-1)

Output:

print(X)

torch.zeros()

torch.zeros() is used to create a tensor with all elements set to zero.

For example, we can create a (2,3) shape of a matrix with contains only zero values using torch.zeros(2,3)

X = torch.zeros(2,3)

Output:

print(X)

Similarly, we can create a (4,2,3) shape of 3 dimensions(3 axes) tensor with only zero values like this:

X = torch.zeros(4,2,3)

Output:

print(X)

torch.ones()

We use torch.ones() to create a tensor with all ones.

X = torch.ones(2,3)

Output:

print(X)

torch.randn()

We use torch.randn() to get a tensor with random elements from a given probability distribution.

This method is mainly used for initializing the model’s parameters randomly.

For example, we can create a (2,3) shape of a tensor (matrix) with elements drawn from a standard Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1 like this:

X = torch.randn(2,3)

Output:

print(X)

torch.tensor()

torch.tensor() is used to create a tensor with contains values that we supply for each element.

For example, we can create a (2,3) shape of tensor(matrix) with values we wanted like this:

torch.tensor([[-1,14],[32,4],[-7,10]])

Output:

print(X)

Also, we can create a (2,2,3) shape of a 3D tensor like this:

create_a_3D_tensor

Output:

create_a_3D_tensor-output

Data Manipulation

Indexing and Slicing

We can access tensor elements by indexing.

For accessing the first element or start of the list we use 0 and -1 use for accessing the last element or end of the list. We use negative indexing for accessing the elements backwardly.

tensor_indexing by X[0] and X[-1]

Output:

tensor_indexing by X[0] and X[-1]-output

We can access whole ranges of indices by slicing like this X[start:stop], the returned value includes the first index (start) but not the last (stop).

x[0:2] and X[0:2}

Output:

x[0:2] and X[0:2}-output

All the above indexing and slicing operations are applied along axis 0 (row axis). Furthermore, we can also specify the column range we want.

X[0:2, 0:1]

Output:

X[0:2, 0:1]-output

To get elements of a matrix by specifying indices like this:

X[1,1]

Output:

X[1,1]-output

We can also write new values to tensor like this:

assign new elements

Output:

assign new elements-output

Mathematical Operations

Here are some of the common elementwise mathematical operations we need:

vector-elementwise operations

Output:

vector-elementwise operations-output

Remember as these are elementwise operations, we should use the same sizes of tensors.

matrix-elementwise operations

Output:

matrix-elementwise operations

But under some conditions, We can still do some operations with different shapes of tensors using the broadcasting mechanism.

broadcasting mechanism

Output:

broadcasting mechanism-output

We can combine the two tensors using torch.cat()

torch.cat()

Output:

torch.cat()-output

Here dim, indicates the axes of tensors that need to join. dim=0 mean the axis 0(row axis) and dim=1 mean the axis=1(column axis)

torch.cat((X,Y), dim=1))

Output:

torch.cat((X,Y), dim=1))-output

Saving Memory

When we are running operations, it allocates new memory to save the results.

For example, if we write Y = X + Y and then write Y = X+X. Though results are assigned to the Y, this second Y is saved in newly allocated memory. We can see this using Python’s id() function, which gives us the exact address of the referenced object in memory.

id(Z)

Output:

id(Z)-ouput

You can see new Y has a different location.

In AIs, we often update a lot of parameters multiple times. This will cause a run out of memory. Also, we may use the same variable in multiple places. So if we update our variable we have to update it in other places too.

To solve these problems, we assign [:] to variables. For example:

Z[:} = X+Y

Output:

print(id(Z))

What Z[:] does is, It assigns the result to the previous Z memory location without using a new location.

If you don’t use the value of X again, you can simply use X[:] = X + Y to reduce the memory overhead of the operation.

Conversion

If you want to convert a tensor to a numpy array, you can do it like this:

Y = X.numpy()

Output:

print(Y)

Also, if you want to convert numpy array to a tensor:

Y = torch.from_numpy(X)

Output:

print(Y)

Dataset Reading

When it comes to dataset storing, We use CSV ( Comma-separated values) file format for storing tabular data, such as spreadsheets and databases. In a CSV file, each line represents a row of data(one record), and the values within each row or field are separated by commas.

CSV

By default, The first row contains headers (column names), and subsequent rows contain data corresponding to those columns. Also, subsequent rows get row numbers ( starting from 0).

We can create the above results.csv file like this:

create result.csv

os.makedirs( ) – creates a folder in the locations you are given, in this example D:’/Templates/data. if the data folder already exists then it will raise an error to stop this we use exits_ok=True condition.

data_file – This variable contains the file name and the location/address of the file. If you want to save the results.csv locally(the same location as the .py file) then you can use data_file = ‘results.csv’

with open( ) – This is one way to open the CSV files. here we give two parameters file location and the permission we need. Permissions are:

with open() permission

.write( ) – This is used to write data to the file.

To read or see the CSV files we use a Python library called pandas. It has a function called read_csv which is used to read CSV file format.

data = pd.read.csv(data_file)

Output:

print(data)

When we work in supervising AI models, we have to separate our dataset fields(columns) into input values and target values. For that, we use the indexing method in pandas,

To Demonstrate this, Here I use the house_prices dataset.

create a house_prices_dataset to prices.csv
print(prices.csv)

In the above dataset Price is our target value and other fields are the input values to the AI model. We can select these columns/fields using .iloc[ ] like this:

separate the dataset

Output:

dataset after the data separation

If you want to divide input values as training data and test data you can do it using training, test = inputs.iloc[0:3, :], inputs.iloc[3:5, :]

For categorical input fields, we can treat NaN as a category. Since the RoofType column takes values, Slate and NaN, pandas can convert this column into two columns RoofType_Slate and RoofType_nan. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_nan to True and False, respectively. The converse holds for a row with a missing RoofType value:

inputs = pd.get_dummies(inputs,dumm_na = True)

Output:

print(inputs)

For missing numerical values, one common method is we can replace the NaN entries with the mean value of the corresponding column. For example in the TotalRooms columns:

inputs = inputs.fillna(inputs.mean())

Output:

print(inputs)

You can see now TotalRooms field NaN values filled with mean values assigned by the pandas.

When working with data, it’s crucial to understand the fundamental methods and concepts of data processing. To ensure your AI model performs as intended, it’s essential to use appropriate techniques for extracting and preprocessing the data. Neglecting these steps can have a negative impact on the accuracy and effectiveness of your model.

Leave a Reply