How Data Science is Changing in the Era of LLMs

David Furrer
Jan 23, 2024
3 min read

Data science is a multidisciplinary field that combines statistical methods, machine learning, data analysis, and domain expertise to extract meaningful insights from data. This article explores how Large Language Models (LLMs) are transforming various subfields within data science. In the past, many complex problems required specialized models and data scientists to solve them, but now, LLMs are making these tasks more accessible.

Prominent Subfields within Data Science

This article will focus on the impact of LLMs in the following subfields of data science:

Natural Language Processing (NLP)
Business Intelligence (BI) / Reporting
Ad-hoc Business Analysis
Time Series Analysis

Natural Language Processing (NLP)

NLP is one of the most prominent areas where LLMs have revolutionized the field. Traditionally, specialized models and data scientists were required to solve various NLP problems. However, LLMs like GPT-4 have become a versatile solution for a wide range of NLP tasks:

Text Classification: LLMs can perform text classification tasks with high accuracy.
Question Answering: They excel at answering questions based on textual data.
Topic Modeling: LLMs can discover abstract topics within documents.
Named Entity Recognition (NER): Extracting names of people, locations, email addresses, etc., can be achieved using LLMs.
Sentiment Analysis: LLMs are proficient at determining the sentiment of text.
Text Summarization: Summarizing lengthy text documents is another task they can handle effectively.
Text Translation: LLMs can translate text from one language to another.

Moreover, LLMs open up new possibilities in NLP due to their impressive reasoning abilities and deep language understanding. For instance, NER can be extended to identify more abstract language categories, including "arguments," "ad-hominem attacks," "political statements," or "mentions about a certain topic." Essentially, LLMs enable programmatically handling various language-related operations and categories.

For practical examples of this, refer to Jason Lui's blog article on "Weights & Biases." In general, LLMs excel at extracting structured data from unstructured textual data.

Business Intelligence (BI) / Reporting

Many companies generate regular reports summarizing internal Key Performance Indicators (KPIs) using PDFs or dashboards from Business Intelligence (BI) tools. The report creation process involves defining essential KPIs, extracting relevant data (often via SQL queries), aggregating data into tables or charts, and providing written explanations of the findings.

With the help of LLMs combined with simple Python code, these report generation tasks can be automated if the report structure remains consistent. However, the main challenge lies in accessing the internal knowledge of a company, which often resides with individuals rather than being documented. A full text-to-report system would require an "AutoGPT"-like system with access to all company knowledge and data, a task that current LLMs are not entirely equipped for but are approaching.

Ad-hoc Business Analysis

Ad-hoc business analyses, which typically involve quickly generating reports to answer specific questions using data, have become more manageable with LLMs like GPT-4. Users can upload CSV or Excel files and formulate questions. GPT-4 can then generate and execute Python code to obtain the relevant data and even create visualizations. However, this approach requires users to be skilled in prompting and to have an understanding of the right questions to ask. Additionally, some companies may have limitations on data uploads to external services, but these capabilities are likely to become available on-prem as open-source models continue to improve.

Time Series Analysis

Time series analysis, a statistical technique for analyzing time-ordered data points, is essential for forecasting future values and understanding underlying patterns, seasonality, and trends. While traditional models like ARIMA, VAR, ETS, and Prophet have been widely used, LLMs introduce an intriguing perspective.

Text data can be treated as a time series since words are encoded as tokens with a temporal dimension. This makes LLMs, particularly those using Transformer architectures, suitable for modeling any data with a temporal dimension. In October 2023, Nixtla introduced "TimeGPT-1", a foundational model for time series forecasting. Pre-trained on 100 billion time series data points, it offers zero-shot inference on new, unseen data. The results in the paper indicate superior performance, especially on monthly and weekly time scales. While the API is still in closed beta, it suggests that LLMs could revolutionize time series analysis similarly to their impact on NLP.

Graphic 1: Performance results of TimeGPT (screenshot taken from the TimeGPT-1 paper, p. 7)

Final Thoughts

In conclusion, LLMs are reshaping data science by simplifying complex tasks in NLP, business intelligence, ad-hoc analysis, and even time series analysis. These models offer unprecedented capabilities and automation, making data science more accessible and efficient in the era of LLMs.