What concepts should a data and AI expert master

May 30, 2024

José Luis Marina

What concepts should a data and AI expert master

At taniwa, we are experts in data processing and applying AI models to solve business problems.

But hundreds of companies in the sector say the same thing, so showing what differentiates us is key to working with the right clients.

Our discourse doesn’t differentiate us.
Deep knowledge of certain concepts does differentiate us.
The projects we’ve done and the results we’ve obtained differentiate us.

Working with chatGPT, connecting to an AI API, or building a Machine Learning model is within anyone’s reach. It has value, there’s knowledge required, but the fine work to solve real business problems will be in the details.

You need to know:

How to curate data. Augment it if necessary.
How to orient information to business objectives.
How to design processes and architecture to be useful for users.
Tools and concepts that allow you to do it. And there are many.

Only references and deep knowledge of necessary concepts differentiate us.

What concepts should a data and AI expert master

This is our idea of what we should know to be able to contribute and pull more from one area or another depending on the project.

Knowledge iceberg of a data and AI expert: Where are you?

Going from most ambiguous and easiest to the foundations that support everything:

Basic LLM

We’re talking about a person who knows how to use assistants as copilots for generating texts, excerpts, work plans, etc. Here we must know the limits of LLMs, where they are accurate and where they are useful.

Intermediate LLM

Being able to train a language model to generate specific texts, such as answers to frequently asked questions, text summaries, etc. Here we’re already talking about a deeper understanding of how language models work and how they can be trained. Also how to combine agents and language models to create more complex systems.

Advanced LLM:

We’re talking about distilled models, how they can be used to create recommendation systems, how they can be used to create real-time text generation systems, how they can be used to create translation systems. Also fine-tuning language models for specific tasks and how to lighten models to be able to run them without going broke in the process.

Machine Learning Models

Training Machine Learning models, MLOps lifecycle, language models like BERT, Deep Learning, how to design quality tests for Machine Learning models.

ML Supervised/Unsupervised | Clustering | Regressions | Reinforcement Learning Python | Kaggle

Machine Learning

How supervised, unsupervised, clustering, regression models work and which to use according to the problem. Special understanding of how reinforcement learning works and how to leverage it.

Image Processing

Image classification, object detection, image segmentation. How Deep Learning models can be used for these tasks and how these models can be trained. “Augmentation” process to improve training data.

Data Analysis

Lots of content:

Basic statistics: How data can be analyzed to understand its distribution, trends and relationships between variables.
Data profiling: How data can be analyzed to understand its structure, content and relationships between data objects.
Data cleaning and curation: How data can be cleaned to remove errors and outliers.
Feature extraction: PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), t-SNE (t-distributed Stochastic Neighbor Embedding).
Data clustering: K-means, DBSCAN, Agglomerative Clustering, UMAP.
Regression: Linear, Logistic, Polynomial, Ridge, Lasso.

Time Series Analysis

A whole world in itself, with all the IoT (or stock market) coming, it’s a field that will grow a lot in the coming years. Here we must know how temporal data can be analyzed to understand trends, seasonality and relationships between variables. Pattern detection.

Text Analysis

The foundation of the first layers of language models. Lots of text mining, NLP and of course, Embeddings, which are the basis of language models. Here we must know how text data can be analyzed to understand its content, structure and relationships between words.

The foundation of everything

Files: CSV, JSON, XML, Parquet, Avro, ORC. Data lakes, Cloud Storage.
SQL: Basic, advanced, query optimization.
CBT: BigQuery, Redshift, Snowflake.
noSQL: MongoDB, Cassandra, Redis, ElasticSearch.
ETL: Airflow, Luigi, Prefect. Lots of python.
Architecture: Data lakes, DWH, Data Marts, Data Warehousing, Data Mesh, AWS, Azure, GCP services and others.

Conclusion:

All of the above takes time and is learned through study and practice. There are no shortcuts and there are no magic solutions. Well sometimes there are, if the problem is simple.

Photo from an AI somewhere

Data Coach
Data Scientist
Concepts
Training