Jenny • AI team intern (24.01.02~) https://soyoung97.github.io/profile/
7월 8일
Hi, I’m Jenny (Soyoung Yoon), a research intern at channel talk AI team 🙂
Currently, I’m working on advancing the potential of ALF. Especially, I’m focusing on making LLMs to better understand tabular data.
The task “Table QA” refers to any question answering task that needs to interpret tables. For example, given a table on weather data (Figure 1), the task is to answer the question: “What is the temperature on Busan?”. Commonly, we use the RAG(Retrieval-Augmented Generation) process to linearize table data to a textual format, and put it into LLMs along with the question as input to solve the problem.
Unlike plain text, table dataset has a 2-dimensional structure: row and column. Feeding plain text into LLMs are relatively intuitive: we just have to append them as inputs, and LLM reads the text sequentially from left to right. However, that is not the case for table datasets: we need to convert the 2-dimensional tabular data to plain text, a process called linearization(serialization).
Since the overall performance of LLMs may differ depending on the serialization methods, many works study effective(efficient) ways to linearize table data for LLMs [1][2][3][4], and some works also point out the performance of LLMs drop significantly if we swap rows or columns [5][6], which is not desired and is still an open problem. Lastly, in real-life scenarios, there are a lot of cases where we need to handle long, huge tables, which does not fit into the context length of LLMs.
Thus, reducing the table size is crucial for Table QA and table understanding with LLMs.
[Figure 1: simple example of a Table QA task]
Real-life data analysts handle a tremendous amount of tabular data, often exceeding the context window size of LLMs.
Even if the table size doesn’t exceed the context window size of LLMs, finding relevant information to answer questions from a large-sized table is hard, since it’s like finding a needle in a haystack. In this scenario, lost-in-the middle problem [7] often occurs, dropping performance.
In order to solve this problem, we propose a two-step RAG framework that (1) generates pseudocode to filter table to contain only the relevant information needed to answer the question and (2) conduct table RAG with the shrinked tables. (Figure 2)
By applying code filtering, we’ve improved efficiency (reduced input token length to LLM) along with improved accuracy.
This method can be applied off-the-shelf with any type of LLMs.
We used the WikiTableQuestions [8] dataset (webpage / data viewer). Specifically, we used the pre-processed & linearized version from the StructLM [9] repository. (You can download the full dataset here) Out of the 11 table QA benchmarks that StructLM was tested on, WikiTableQuestions had the longest input length when tokenized by StructLM (longest table size), so we expected that the effect of reducing the table size would be significant.
input average token length | output (answer) average length | |
---|---|---|
FetaQA | 653.2 | 38.8 |
HybridQA | 700.4 | 6.8 |
WikiTQ | 831.8 | 5.8 |
TabMWP | 207.8 | 4.5 |
ToTTo | 251.8 | 31 |
MMQA | 656.2 | 7.7 |
WikiSQL | 689.2 | 7.1 |
KVRet | 573.4 | 17.1 |
TabFact | 660.1 | 4.6 |
Feverous | 799.3 | 3.4 |
Infotabs | 276.9 | 3.7 |
The full code with end-to-end process can be found at this colab repo.
One example of linearized data from WikiTableQuestions is illustrated below:
When visualized, the table contains the following information: (Figure 3)
[Figure 3: Visualized example of given table]
The question is: what is the only hospital to have 6 hospital beds?
From the table, we can find that ‘Vidant Bertie Hospital’ is the correct answer.
Let’s directly feed this input to GPT3.5 and get answer. (Full code in this colab repo)
After feeding the above input to GPT3.5, we get the following answer:
it fails to answer “vidant bertie hospital” correctly.
Now, we’ll first conduct pseudocode filtering of the table and then feed the shrinked table to GPT3.5.
First, we give input as the following:
We give a summarized overview version of the table, in a format of column name, and a list of possible values (a maximum of 5 distinct values can be displayed). Then, we ask the LLM to write a simple pseudocode, with properties of “Columns” and “Filters”. The output looks like the following:
According to the output, we parse the pseudocode and apply filtering to the table. Lastly, we conduct inference once again with the shrinked tables, with exact input below:
From this input, GPT3.5 outputs the following:
GPT3.5 correctly outputs the answer.
We’ve tested pseudocode filtering on the full test set of 4344 questions of WikiTableQuestions test set. We’ve also applied on top of StructLM-13B, a fine-tuned model from CodeLlama for general structured knowledge grounding task. We’ve noticed that pseudocode filtering decreases input length by more than half, and improves output accuracy for both GPT3.5 and StructLM-13B.
*the pseudocode filtering was done with GPT3.5, and the price is calculated by the pricing information of gpt-3.5-turbo-0125 (0.5$/1M input tokens)
Model | Acc. | average input token length | Average input token length for filtering prompt | |
---|---|---|---|---|
GPT3.5 | [before] Direct RAG | 53.0% | 833.7 (0.00042$ / query) | - |
[after] Pseudocode filtering | 55.4% (+2.4%) | 370.8 (-44%) (0.00018$ / query) | 509 (0.00025$ / query) | |
StructLM-13B | [before] Direct RAG | 53.1% | 833.7 | - |
[after] Pseudocode filtering* | 53.9% (+0.8%) | 370.8 (-44%) | 509 |
Even if the input token length more than halved by using pseudocode filtering, we also have to take into account the additional inference conducted for pseudocode generation, which is shown as the “Average input token length for filtering prompt” column on the result table. We could try out ways to reduce this overhead and improve pseudocode generation capabilities by using smaller-sized models specialized for code generation.
Although there’s not many code parsing failures, there could be cases where the generated code doesn’t fit into the grammar. In this case, we’ve applied a heuristic to revert to the original table. Application of a carefully designed prompt may help on partially resolving this issue.
Although table QA with LLM overall benefits from pseudocode filtering, there were also cases when relevant information needed to answer the question are discarded after pseudocode filtering. A partial solution is to append the original table data before the shrinked table, but it would result in increased token length. Thus, improving recall would be crucial to effectively apply pseudocode filtering into Table QA tasks.
There are many works & papers related to effectively applying LLMs to Tabular data.
We’ve seen how we can feed tabular data to LLMs by table linearization, and how a Table QA task is formulated. Specifically, we’ve shown by a simple demo that we can improve efficiency and performance of Table QA tasks by pseudocode filtering. Actually, there’s lot of works regarding on reducing & decomposing input tables [10] [11], integrating symbolic reasoning to LLMs [12][13], or applying of complex, multiple reasoning pathways to guide better generation [14][15]. Try on playing out with your own table data on this colab repo, and if you want to know more about handling table data with LLMs, consider joining our AI team! 🙂
[2] Large Language Models are Complex Table Parsers
[5] ROBUT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
[6] Rethinking Tabular Data Understanding with Large Language Models
[7] Lost in the Middle: How Language Models Use Long Contexts
[8] Compositional Semantic Parsing on Semi-Structured Tables
[9] StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
[10] CABINET: Content Relevance based Noise Reduction for Table Question Answering
[12] API-Assisted Code Generation for Question Answering on Varied Table Structures
[13] BINDER: BINDING LANGUAGE MODELS IN SYMBOLIC LANGUAGES
[14] REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
[15] ReAcTable: Enhancing ReAct for Table question Answering
We Make a Future Classic Product
채널팀과 함께 성장하고 싶은 분을 기다립니다