Skip to main content
Image
AI (SD) - Illusion Diffusion HQ, "Digital files"

The coalition is excited to announce the publication of several bulk datasets for research and public access to primary legal resources from the Kentucky Attorney General. This is a first step in a project to make these resources available on our website.

The published datasets include:

  1. AG Opinions & Decisions, Lexis 1977-2022
  2. Scraped AG Website Opinions & Decisions with Attachments, 1992-present
  3. Supplementary AG Records, 1976-present (mostly TIFF scans of older records)


AG Opinions & Decisions, Lexis

The first dataset is the most comprehensive and accessible. These records are sourced from Lexis and republished on archive.org by the nonprofit organization Public.Resource.Org. There was a multistep process that took these raw publications and extracted them ultimately into a spreadsheet of decisions that can be opened in Excel or Google Docs.

The Lexis records are giant text files comprising multiple years worth of opinions in a single RTF/ODF file. So our first step in making this data easier to access was to separate the giant files into one file per opinion/decision. For anyone interested in replicating or following along, that process is documented in this Python notebook.

After getting the records into individual files, I wrote another notebook to extract as much useful information from the text files as possible before turning to AI or other more advanced or expensive techniques. The fields we extracted from the raw text include:

  1. The AG Citation, ex. OAG 20-04 or 21-ORD-022
  2. Lexis Citation(s) associated with the opinion.
    Note: I found in this process that there are some opinions that were republished by Lexis under multiple citation numbers. This was unexpected and something I’m still looking into.
  3. Publication date
  4. Full HTML and Decision HTML
  5. Full Text (stripped of HTML) and Decision Text
  6. Decision HTML Array - When the same AG citation was found more than once, all variants of the decision are preserved
  7. Citations for other AG records from the body text, for relation tracking
  8. Citations for court decisions cited, parsed using the eyecite library.

 

Scraped and Supplemental Records

While the primary dataset is the most comprehensive, it does not currently include 2023/24 AG records or some metadata that is available on the AG website; OAGs have summary information available and ORDs/OMDs have party metadata that Lexis does not consistently republish.

I made a request for these records in bulk to the Attorney General’s office, along with the scans of older opinions they maintain. The OAG was very helpful in providing all of the older scans but explained that decades of administration changes as new AGs were elected has resulted in a fractured internal landscape of records. I never got a clear understanding here, but there seems to be some internal pipeline after an opinion is finalized that does the appropriate formatting of records, makes the web accessible copies, and publishes them online. The AG staff seem to only be involved to a point, after which the website and its contents are more a control of technologists than assistant attorneys general.

Could someone complain on this matter and make insistences? Maybe. But, I also am grateful to the AG staff for the considerable efforts they put into providing records and explaining how they understand the web versions (not as totally official), and I understand the challenges that all organizations face in historical record keeping. Technology has changed so much in the last 50 years, and all organizations undertstandably prioritize the work in front of them over making sure their 30 year old files are sorted properly.

So, with the implicit approval of the AG’s records custodian, I wrote a scraper to download all of the tabular data and attachments from the AG website covering all OAGs, ORDs, and OMDs it stores. This is published in a separate dataset both to distinguish its source and because there are some subtle differences between the collections. All records that the AG’s office provided in response to my original request are published as a supplemental dataset for historical purposes.

 

Moving Forward

The coalition’s website will soon include tools to search through all available Attorney General Opinions and Decisions along with a curated selection of other resources relevant to government transparency in Kentucky.

Apart from just getting the records online, I have been conducted a few exciting experiments with using large language models like GPT-4 and Mistral to enrich these data in several different ways:

Extracting party data - The AG’s website lists party information on the tables of records and meetings decisions, but this data is not available for opinions pre-93.

Summarizing Opinions - LLMs excel in summarizing, however legal texts can fall into some known traps for summarization models, like paying too much attention to footnotes or lengthy segways.

Extracting Shepardization Data - An important part of any form of legal research is determining whether a record you want to cite has been overturned on appeal, a process attorneys refer to as “Shepardization.” There is no available resource comparable to “Shepard’s Citations” for KYAG decisions on open records and meetings, and I am very interested in using LLMs to extract this kind of information for building a list of precedent in the AG office that has not been overruled internally. 

This is the area I have spent the most time with experiments, finding mixed results with small, open-source models but fairly exceptional results with GPT3.5 and above. I published a page demonstrating all 2020 open records decisions that were summarized and had citation treatment described by GPT3.5-turbo from OpenAI. The cost to generate this data was $0.95.

We aim to provide the best possible resources for understanding our state’s open records and meeting laws as a free, public resource. If you have any suggestions for improvement, please get in touch!

Categories