DATA PREPARATION FOR MACHINE LEARNING

Preamble

You have probably noticed that artificial intelligence and machine learning are currently popular topics. Absolutely, as it is a very significant issue that will undoubtedly influence how we live in the future. However, after looking at the majority of the machine learning-related code that is available online, it occurs to me how much “coding” people actually use to prepare the data. Often, the input data is “just a matrix,” created over a long period of time by numerous people (and occasionally from memory).

The obvious question in my situation is: Why not prepare the data in SQL? It is much easier, more flexible, and demands less effort.

Creating some sample data

I’ll create a straightforward table with just 20 random values to show what we can do with PostgreSQL. Let’s define a table first:

 test=# CREATE TABLE inputdata ( 	id int, 	data numeric 	DEFAULT random() * 1000 ); CREATE TABLE The table is populated: test=# INSERT INTO inputdata 	SELECT * FROM generate_series(1, 20); INSERT 0 20

Although we currently only have a table with 20 randomly chosen values, you actually already have some data available, such as:

 test=# SELECT * FROM inputdata; id | data ----+------------------ 1 | 542.76927607134 2 | 813.954454381019 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 6 | 326.086463406682 7 | 24.8975520953536 8 | 266.512574627995 9 | 86.0621216706932 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 13 | 975.230060052127 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 17 | 208.689162041992 18 | 529.119417071342 19 | 260.399237740785 20 | 563.285110052675 (20 rows)

Thoughts on sampling, training, and verification

You will always begin by dividing the data into different parts if you are training an AI model (perhaps a Support Vector Machine or SVM, a neural network, or whatever).

• Data needed to train your AI model
• Data for testing your AI model

To train your model, use the training data. The performance of your model is then evaluated using the test data. Data segmentation is crucial and, in a sense, the secret to success.

Table sampling in PostgreSQL

PostgreSQL core already includes sampling as of version 9.5. As an illustration:

 test=# SELECT * 	FROM inputdata TABLESAMPLE BERNOULLI (50) 				REPEATABLE (87); id | data ----+------------------ 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 19 | 260.399237740785 20 | 563.285110052675 (11 rows)

The table must be amended to include the TABLESAMPLE clause and guidelines for how it should function. I decided to use the repeatable Bernoulli table sampling method (TSM) in this case. In order to ensure that we can train our model repeatedly using the same input data, it makes sense to use the REPEATABLE clause in the context of machine learning. No matter how many times we run the code, PostgreSQL will consistently return the same sample (as long as the underlying data stays the same, of course).

Machine learning: Lots and lots of data …

So far, everything seems okay. This tactic does have a small disadvantage, though. To function properly, a state-of-the-art model requires a LOT of input data. The number of rows is in the millions. The following issue (expressed in pseudo code) is one we will eventually encounter:

 SELECT	* FROM 	inputdata WHERE	id NOT IN (SELECT id FROM test_data_set);

This raises the following two issues:

• It requires a lot of storage space to keep the test data set.
• The price of the large NOT IN statement is reasonable.

Is there another way to carry out that task then? What I came up with is a fairly easy-to-implement tactic that works.

Because PostgreSQL lacks a “NOT IN TABLESAMPLE”-clause and because we don’t want to duplicate our data, the idea is to use a view that can be used to extract the sample:

 test=# CREATE VIEW preparation_step1 AS 	SELECT 	*, abs(hashtext(id::text) % 100) 	FROM 	inputdata ; CREATE VIEW

The hashtext function is used to first generate a hash from the input data. The hashtext yields evenly spaced numbers, which is exactly what we need in this situation. It will produce 100 slices of data, each containing 1% of the total. Be mindful that the hashtext function is capable of returning negative values. Positive values will be generated by the “abs” function.

 test=# \x Expanded display is on. test=# \df *hashtext* List of functions -[ RECORD 1 ]-------+----------- Schema | pg_catalog Name | hashtext Result data type | integer Argument data types | text Type | normal

The outcome will already be quite helpful when considering the view:

 test=# SELECT * FROM preparation_step1 LIMIT 10; id | data | abs ----+------------------+----- 1 | 542.76927607134 | 47 2 | 813.954454381019 | 26 3 | 215.18046176061 | 4 4 | 989.989245776087 | 92 5 | 142.890753224492 | 58 6 | 326.086463406682 | 12 7 | 24.8975520953536 | 95 8 | 266.512574627995 | 88 9 | 86.0621216706932 | 36 10 | 801.756543107331 | 81 (10 rows)

We can now exclude certain data. An illustration: “abs 50” may be training data, and the remaining information may be used to validate and examine our models.

If your dataset is extremely huge, this strategy is generally acceptable (xxx-million rows or so). If your dataset is too small, it might not be the best. It would be preferable in this situation to utilize the sample techniques offered by your chosen library (TensorFlow, sklearn, etc.). What makes that so? You incur the danger of producing a biased sample if your data collection is small (e.g., only a few hundred rows). Why does that matter? Consider you have a dataset with details about men and women. The sample should have the same distribution as the original data set, which means that there should be an equal number of males and women. Libraries like sklearn and others support the “stratification” approach to solving the issue. In my straightforward SQL example, I’m assuming that the model will receive a significant amount of data, so stratification won’t be a problem.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

How Predictive Database Analytics Helps Optimize Cloud Resource Utilization

23 June 2026
Database Performance Management

As enterprises continue migrating workloads to the cloud, optimizing resource utilization has become a critical business priority. Cloud infrastructure offers scalability, flexibility, and operational agility, but it also introduces new cost and performance challenges. Without proper visibility into workload behavior, organizations often struggle to balance application performance with infrastructure efficiency. At the center of this … Continue reading “How Predictive Database Analytics Helps Optimize Cloud Resource Utilization”

Why Proactive SQL Performance Monitoring Is Essential for Enterprise Growth

Database Performance Management

In today’s digital economy, enterprise growth depends heavily on application speed, scalability, and reliability. As businesses expand their digital services, customer interactions, transactions, analytics, and operational workloads grow exponentially. Behind nearly every business-critical application lies SQL-driven databases that process and manage massive amounts of structured data in real time. From financial transactions and e-commerce purchases … Continue reading “Why Proactive SQL Performance Monitoring Is Essential for Enterprise Growth”

How to Enable Data-Driven Media Growth with Enteros Cost Attribution and Software Management

22 June 2026
Software Engineering

Introduction The media industry is experiencing one of the most significant transformations in its history. Streaming services, digital publishing platforms, online advertising ecosystems, video-on-demand applications, and content distribution networks have fundamentally changed how audiences consume content. Modern media organizations now operate highly complex digital ecosystems that support: Streaming platforms Digital publishing systems Video content delivery … Continue reading “How to Enable Data-Driven Media Growth with Enteros Cost Attribution and Software Management”

How to Enable Intelligent Wealth Management Operations with Enteros Database Software, AIOps Platform, and Gen AI

Database Performance Management

Introduction The wealth management industry is undergoing a major transformation. As investors demand personalized financial services, real-time portfolio visibility, and digital-first experiences, wealth management firms are increasingly relying on technology to drive operational efficiency, improve client engagement, and accelerate business growth. Modern wealth management organizations now support: Portfolio management platforms Wealth advisory applications Digital client … Continue reading “How to Enable Intelligent Wealth Management Operations with Enteros Database Software, AIOps Platform, and Gen AI”