[AI] Table QA with LLMs - Huge table parser by pseudocode filtering

Jenny • AI team intern (24.01.02~) https://soyoung97.github.io/profile/

  • 테크 인사이트

Hi, I’m Jenny (Soyoung Yoon), a research intern at channel talk AI team 🙂

Currently, I’m working on advancing the potential of ALF. Especially, I’m focusing on making LLMs to better understand tabular data.

Background: Table QA with LLMs, and issues

The task “Table QA” refers to any question answering task that needs to interpret tables. For example, given a table on weather data (Figure 1), the task is to answer the question: “What is the temperature on Busan?”. Commonly, we use the RAG(Retrieval-Augmented Generation) process to linearize table data to a textual format, and put it into LLMs along with the question as input to solve the problem.

Unlike plain text, table dataset has a 2-dimensional structure: row and column. Feeding plain text into LLMs are relatively intuitive: we just have to append them as inputs, and LLM reads the text sequentially from left to right. However, that is not the case for table datasets: we need to convert the 2-dimensional tabular data to plain text, a process called linearization(serialization).

Since the overall performance of LLMs may differ depending on the serialization methods, many works study effective(efficient) ways to linearize table data for LLMs [1][2][3][4], and some works also point out the performance of LLMs drop significantly if we swap rows or columns [5][6], which is not desired and is still an open problem. Lastly, in real-life scenarios, there are a lot of cases where we need to handle long, huge tables, which does not fit into the context length of LLMs.

Thus, reducing the table size is crucial for Table QA and table understanding with LLMs.

[Figure 1: simple example of a Table QA task]

Overview

  • Real-life data analysts handle a tremendous amount of tabular data, often exceeding the context window size of LLMs.

  • Even if the table size doesn’t exceed the context window size of LLMs, finding relevant information to answer questions from a large-sized table is hard, since it’s like finding a needle in a haystack. In this scenario, lost-in-the middle problem [7] often occurs, dropping performance.

  • In order to solve this problem, we propose a two-step RAG framework that (1) generates pseudocode to filter table to contain only the relevant information needed to answer the question and (2) conduct table RAG with the shrinked tables. (Figure 2)

  • By applying code filtering, we’ve improved efficiency (reduced input token length to LLM) along with improved accuracy.

  • This method can be applied off-the-shelf with any type of LLMs.

[Figure 2: Comparison between naive table RAG v.s. table RAG with pseudocode filtering]

Dataset

We used the WikiTableQuestions [8] dataset (webpage / data viewer). Specifically, we used the pre-processed & linearized version from the StructLM [9] repository. (You can download the full dataset here) Out of the 11 table QA benchmarks that StructLM was tested on, WikiTableQuestions had the longest input length when tokenized by StructLM (longest table size), so we expected that the effect of reducing the table size would be significant.

input average

token length

output (answer)

average length

FetaQA

653.2

38.8

HybridQA

700.4

6.8

WikiTQ

831.8

5.8

TabMWP

207.8

4.5

ToTTo

251.8

31

MMQA

656.2

7.7

WikiSQL

689.2

7.1

KVRet

573.4

17.1

TabFact

660.1

4.6

Feverous

799.3

3.4

Infotabs

276.9

3.7

Example Demo: direct RAG v.s. pseudocode filtering

One example of linearized data from WikiTableQuestions is illustrated below:

When visualized, the table contains the following information: (Figure 3)

[Figure 3: Visualized example of given table]

The question is: what is the only hospital to have 6 hospital beds?

From the table, we can find that ‘Vidant Bertie Hospital’ is the correct answer.

1. Direct RAG

Let’s directly feed this input to GPT3.5 and get answer. (Full code in this colab repo)

After feeding the above input to GPT3.5, we get the following answer:

it fails to answer “vidant bertie hospital” correctly.

2. Pseudocode filtering

Now, we’ll first conduct pseudocode filtering of the table and then feed the shrinked table to GPT3.5.

First, we give input as the following:

We give a summarized overview version of the table, in a format of column name, and a list of possible values (a maximum of 5 distinct values can be displayed). Then, we ask the LLM to write a simple pseudocode, with properties of “Columns” and “Filters”. The output looks like the following:

According to the output, we parse the pseudocode and apply filtering to the table. Lastly, we conduct inference once again with the shrinked tables, with exact input below:

From this input, GPT3.5 outputs the following:

GPT3.5 correctly outputs the answer.

Quantitative Analysis

We’ve tested pseudocode filtering on the full test set of 4344 questions of WikiTableQuestions test set. We’ve also applied on top of StructLM-13B, a fine-tuned model from CodeLlama for general structured knowledge grounding task. We’ve noticed that pseudocode filtering decreases input length by more than half, and improves output accuracy for both GPT3.5 and StructLM-13B.

*the pseudocode filtering was done with GPT3.5, and the price is calculated by the pricing information of gpt-3.5-turbo-0125 (0.5$/1M input tokens)

Model

Acc.

average input

token length

Average input

token length for

filtering prompt

GPT3.5

[before]

Direct RAG

53.0%

833.7

(0.00042$ / query)

-

[after]

Pseudocode filtering

55.4% (+2.4%)

370.8 (-44%)

(0.00018$ / query)

509

(0.00025$ / query)

StructLM-13B

[before]

Direct RAG

53.1%

833.7

-

[after]

Pseudocode filtering*

53.9% (+0.8%)

370.8 (-44%)

509

 

Next Steps

1. Reducing the overhead

  • Even if the input token length more than halved by using pseudocode filtering, we also have to take into account the additional inference conducted for pseudocode generation, which is shown as the “Average input token length for filtering prompt” column on the result table. We could try out ways to reduce this overhead and improve pseudocode generation capabilities by using smaller-sized models specialized for code generation.

2. Pseudocode parsing failure

  • Although there’s not many code parsing failures, there could be cases where the generated code doesn’t fit into the grammar. In this case, we’ve applied a heuristic to revert to the original table. Application of a carefully designed prompt may help on partially resolving this issue.

3. Improving recall

  • Although table QA with LLM overall benefits from pseudocode filtering, there were also cases when relevant information needed to answer the question are discarded after pseudocode filtering. A partial solution is to append the original table data before the shrinked table, but it would result in increased token length. Thus, improving recall would be crucial to effectively apply pseudocode filtering into Table QA tasks.

Conclusion

There are many works & papers related to effectively applying LLMs to Tabular data.

We’ve seen how we can feed tabular data to LLMs by table linearization, and how a Table QA task is formulated. Specifically, we’ve shown by a simple demo that we can improve efficiency and performance of Table QA tasks by pseudocode filtering. Actually, there’s lot of works regarding on reducing & decomposing input tables [10] [11], integrating symbolic reasoning to LLMs [12][13], or applying of complex, multiple reasoning pathways to guide better generation [14][15]. Try on playing out with your own table data on this colab repo, and if you want to know more about handling table data with LLMs, consider joining our AI team! 🙂


References

[1] Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

[2] Large Language Models are Complex Table Parsers

[3] Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios

[4] GPT4Table: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

[5] ROBUT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations

[6] Rethinking Tabular Data Understanding with Large Language Models

[7] Lost in the Middle: How Language Models Use Long Contexts

[8] Compositional Semantic Parsing on Semi-Structured Tables

[9] StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

[10] CABINET: Content Relevance based Noise Reduction for Table Question Answering

[11] Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning

[12] API-Assisted Code Generation for Question Answering on Varied Table Structures

[13] BINDER: BINDING LANGUAGE MODELS IN SYMBOLIC LANGUAGES

[14] REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS

[15] ReAcTable: Enhancing ReAct for Table question Answering

We Make a Future Classic Product

채널팀과 함께 성장하고 싶은 분을 기다립니다

사이트에 무료로 채널톡을 붙이세요.

써보면서 이해하는게 가장 빠릅니다

회사 이메일을 입력해주세요