Skip to main content
Image
AI (Pixel Art XL) - A cartoon of a helpful robot scanning a beam of light over stacks of books

Transparency is critical in a functioning democracy. At the heart of this transparency lies open records laws, also known as sunshine or freedom of information act (FOIA) laws, which grant the public access to government documents and information. Despite the fundamental importance of these laws, navigating their intricacies has long presented a significant challenge for the general public.

New technologies, including Large Language Models (LLMs), embeddings and vector databases, present novel ways of tackling this problem in a relatively inexpensive and scalable way. These are extremely powerful tools that enable computers to solve problems that were, until very recently, only solvable by humans. Newer releases of OpenAI’s GPT-4 and Anthropic’s Claude 3 are able to behave as AI “Agents” , solving multi-step problems using tools, advanced reasoning, and increasingly long “context windows.”

To that end, I have been working on a few experiments on streamlining research into open records laws and potentially answering user questions about the process of requesting records. You can see a static demo at the bottom of this article or on the HuggingFace website. To explain a bit of what went into it…

 

First, A Primer On Open Records Law in Kentucky

Don’t worry, we won’t wade too far into the weeds here.

Access to government records in our state is enshrined by the Kentucky Open Records Act (KORA), originally established in 1976 in the Kentucky Revised Statutes (KRS). The dozen statutes currently comprising the KORA define what a public record is in our state, who can request to view or copy them, what’s exempt, the request and appeals process, and agency requirements in applying the law and responding to requests.

Residents of Kentucky, broadly defined, are able to request to view or copy public records from any body meeting the definition of a public agency in Kentucky. If they don’t like the answer they get, they can appeal to either the Attorney General of Kentucky or the Circuit Court in the county where the offending agency is located. Circuit court appeals, like any other case, can escalate to the Court of Appeals or the Kentucky Supreme Court.

As part of our centuries-old system of common law, decisions reached in cases by the Kentucky Supreme Court and Court of Appeals, unless marked by the court as Unpublished, set state-wide precedent for how the law should be applied (at least, until the General Assembly gets upset at a decision and changes the law, like with this year’s HB 509).

Employees of the Legislative Research Commission (LRC) have, over the years, maintained a growing list of annotations of precedent-setting appellate cases and Attorney General decisions for all of the statutes in the KRS. Here’s an example annotation from a well publicized case against the UofL Foundation:

Future donors to the University of Lousiville (sic) Foundation are on notice that their gifts are being made to a public institution and, therefore, are subject to disclosure under KRS 61.871 regardless of any requests for anonymity. Cape Pub'ns, Inc. v. Univ. of Louisville Found., Inc., 260 S.W.3d 818, 2008 Ky. LEXIS 176 (Ky. 2008).

Side note - I didn’t even think about this as I copied this annotation, but this is another case that eventually led to a legislative change, this one to shield donor identity!

 

Searching the Annotations

You might expect the annotations to be a part of the “official” KRS published online by the LRC, but that is sadly not the case. For decades, although relied upon by lawyers and judges across the state in interpreting the law, these annotations were restricted to the expensive “certified” KRS editions published by the legal research duopoly of LexisNexis and Thompson-Reuters.

We in Kentucky, along with many other states, owe an ongoing debt of gratitude to Carl Malamud and his non-profit Public.Resource.Org for fighting a case against the state of Georgia all the way to the Supreme Court of the United States that declared these primary legal materials not subject to copyright. And, lest you think they’re now giving them away, please allow me to dispel you of that notion. Carl has hard earned the right to purchase from Lexis copies of state law from a growing number of states, including Kentucky, and republish them as records in the public domain on archive.org.

I wish I could say that was the end of it and now everyone can use them easily, but alas, the text of the KRS is packed into 51 dense tomes, seemingly designed to be as accessible as the double helix strands of DNA. The KORA is one small part of Chapter 61, itself one part of Title VIII of the KRS. Each of the Titles uses a single rich text file (RTF, like notepad with fonts) to store the entirety of the statutes, annotations, resource material citations and other goodies that comprise it. There are 3.4 million characters of text in Title VIII alone.

When I decided to try these experiments, I first considered just copying and pasting the sections of text I was interested in and using those for building a searchable database. There are a growing array of techniques for “chunking” large bodies of text in ways that the smaller chunks can be identified and searched. The difficulty here is that a statute is often only correctly interpreted in its full context, or even with the context of neighboring statutes. Annotations are similarly only correctly understood in their entirety and isolated from neighboring annotations, but still within the context of the statute they are associated with.

For data scientists, this is considered “unstructured” data that we would like to turn into structured data. Given a bible, we want a queryable spreadsheet of individual verses.

This is a problem I’ve thought about off and on for a few years now. Since Carl began republishing the KRS, I’ve wanted to find a good way to make the annotations for the KORA and its sister Open Meetings Act available on this website, and generally more machine-readable to use in applications.

And again, I was able to turn to work done by the excellent Public.Resource.Org as a source of inspiration and starting momentum. Understanding the accessibility issues inherent to the giant text files, the beautify state codes project exists as part of a process to convert the RTF files into HTML files viewable through a web browser, and with embedded anchor links to scroll to relevant sections.

Using this project as a baseline, I rewrote the file parser to extract statutes and their associated annotation data into two spreadsheets for each Title. The parser does not have 100% coverage of all titles, currently failing on Titles 29 and 40, but does a reasonably good job of extracting the data into a more machine-accessible format.

 

As Little As Can Be Said On Embeddings and Vector Databases

Two technologies that have spun out of LLMs are text embeddings and vector databases. This isn’t meant to be a technical deep dive into any one topic, and these are areas you can spend a lot of time trying to understand fully, but these are important parts of what make these new tools so powerful.

Vectors are groupings of numbers, plain and simple. Embeddings are a special kind of vector that specific LLMs have been built to convert text into, in a way that better captures the “meaning” of the words. You can think of embeddings as something like a “language” that the LLM itself “speaks.” Unlike our language, where we generally want to be efficient in communication, embeddings capture the meaning of words by comparing them to other words they are semantically linked with. 

An embedded vector for a given text contains a lot more information than the original text itself, carrying a great deal of knowledge inferred by the LLM from its previous training on massive sets of text. This makes them useful for searching in a way beyond keywords, where the semantic relationships between words is captured, and vectors for similar topics will be similarly “close” in a way that can be calculated and searched against.

And so, vector databases have exploded in the past few years as purpose-built tools for storing, manipulating, and querying these large objects. I’ve played around with a few available options, and am currently relying on the Chroma database for this project with the open source and locally running all-MiniLM-L6-v2 embeddings model.

 

Llama 2 and GPT

While OpenAI’s GPT models get most of the fanfare these days, there are a growing number of other players, and Meta’s Llama 2 has been hailed as the best open source model so far released by a large corporation. My first experiment with answering open records questions used this model, available to run for free on Kaggle.com, among other providers.

This initial experiment, using a technique referred to as Retrieval Augmented Generation, wasn’t without issues, including “hallucinations” and more bad statute citations than correct ones. But the ability of the chain as a whole to generally find relevant annotations and apply largely correct reasoning impressed both me and my co-director, Amye Bensenhaver, as something to look further into.

I ultimately decided to experiment with the more advanced models available with OpenAI’s paid API. Beginning with releases last November, OpenAI has been advertising new releases of its GPT3.5 and GPT4 model lines as having received additional “fine-tuning” training on tool use. These models can be given a list of described tools along with a contextual prompt and user input, and can choose when it is appropriate to use given tools for specific tasks. Research in this area has shown that the LLMs can choose appropriate tool use about as accurately as humans can.

This kind of behavior can be referred to as a simple form of “Agent” behavior, where the AI chooses how many cycles of processing it needs (given some guardrails) to answer a question. During this time it can use tools to get more context, have intermediate reasoning, or ask the user for information, depending on its programming.

For the demo below, I defined four different "tools" that the AI agent can use for more context:

  1. It can look up the full text of a statute from the Kentucky Open Records Act
  2. Look up the full text or summary of an exception from the Act or one incorporated from state or federal law (i.e. FERPA, HIPAA).
  3. Perform a semantic search query against case law annotations.
  4. Search against attorney general opinion summaries.

Overall, the results were much more impressive and accurate than from the earlier test. The newest GPT3.5 model invoked tools and provided reasonably good answers with a fair amount of hedging, while GPT4 performed thorough research and provided detailed and accurate responses to the questions provided. My own testing with GPT4 was fairly limited though, because costs incurred with the long context used in these queries quickly added up on my bill.

 

A Note on Risk

There are notable risks and barriers to deploying tools of this kind for wider public use:

Cost. The best results are currently produced with the most expensive LLMs, although with any engineering task there are opportunities for cost reductions over time given the opportunity. Providing this as a service would incur increasing costs as usage increases.

Hallucinations. LLMs are known to "hallucinate" or invent details when asked a question they are unsure about. Several techniques I have used here are designed to reduce hallucinations, specifically the grounding context and the ability of the Agent to query for more context. Hallucinations in a legal context can be dangerous, and extra care would be required before making such a tool widely available.

Annotation accuracy. During testing, the broad language used in certain annotations of the KRS led to questionable, possibly novel legal arguments to specific user questions. One annotation about public emails on private servers could be taken to imply that public servants would be required to turn over private devices for search, but the text of the Attorney General opinion being annotated is more nuanced. These lead to responses from the tool which are consistent with its inputs, but not necessarily the views held by experts or the courts.

General Accuracy. While the outputs have been impressive and largely accurate, there are major risks in deploying such a tool for unsupervised public use. This is a limited demo that only uses limited statute text and annotations of cases, rather than the full text of court decisions and Attorney General opinions. This tool in no way comes close to replicating the process of legal research.

Infrastructure. This demo was rapidly put together using tools intended for building proof-of-concept demos, and would require reengineering to deploy for public use.

 

Let’s Get to the Demo!

Clicking on an example question from the list will populate the final response in the middle panel of the screen and the Agent's tool used in generating the answer in the right panel. 

The embedded demo here only really works on desktop. If you're looking on a phone you can visit it on HuggingFace, but the list of examples is awkwardly shoved near the bottom of the page, and clicking will update sections above the example list.

 

A Call To Dream

LLMs and other AI tools being developed represent a paradigm shift in the way we will approach computers in the years to come. To say they are a disruptive technology is likely to be a major understatement as the rollout of AI-powered tools progresses. The integration of AI technologies into the realm of open records law presents a significant opportunity in promoting democracy and transparency. By harnessing the power of AI, we can empower citizens to access government information more efficiently and effectively.

It is critical for the public to seize these tools as an opportunity to empower themselves to better face the challenges of tomorrow. It is easy to see the dangers around the corner that computational reasoning could bring, and we should watch for those things and face them when the time comes. But, it is equally, if not more important, that we dream of better ways to use the tools we have available to us both to enrich our lives and to protect what is important.

If you’ve made it this far with me, I hope you’ll agree that making our government more transparent is a worthy goal and will spend time finding ways to help others in their own pursuit of knowledge.

Categories