Decoding the Past: The Challenges of Predictive Coding in eDiscovery

Decoding the Past: The Challenges of Predictive Coding in eDiscovery

Electronic discovery (“eDiscovery”) is the process of identifying, gathering, and analyzing electronically stored information (ESI) for use in legal proceedings. As the volume of ESI proliferates, so does the necessity for sophisticated methods to navigate and interpret this vast information landscape. While Predictive coding and Continuous Active Learning have been heralded as frontrunners in addressing these challenges, they come with constraints. Meanwhile, newer technology offers a promising alternative, potentially reshaping the eDiscovery paradigm.

Predictive Coding in eDiscovery

Predictive coding is a technology-assisted review (TAR) method that includes reviewing and coding a subset of documents and then training a machine learning model on this subset to predict the relevance of the remaining documents.

Overfitting in Predictive Coding

Overfitting is a common problem in machine learning and predictive modeling. It occurs when a model captures the noise or random fluctuations in the training data rather than the underlying distribution. When applied to predictive coding in eDiscovery, overfitting can have serious consequences, especially in the context of the document review process.

In the context of eDiscovery, if the initial set of documents used to train the predictive coding model is too small or not representative of the entire dataset, the model might learn patterns specific to such a limited or non-representative subset, which do not generalize well. Highly complex models with many parameters can closely fit the training data, capturing even its noise. While these models might achieve high accuracy on the training set, they often perform poorly on new, unseen data.

Implications of Overfitting in the Document Review Process

Overfitted models can result in a high number of false positives or false negatives when classifying new documents. It can lead to relevant documents being overlooked or irrelevant documents being mistakenly flagged for review.

If a model falsely identifies too many relevant documents, it can increase manual review costs. Conversely, if relevant documents are missed, increased costs and other negative consequences may be associated with missed evidence or compliance issues.

Overfitting can lead to stakeholders losing confidence in the predictive coding process. If legal teams cannot trust the tool to provide accurate predictions, they might revert to manual review processes or be hesitant to use predictive coding in the future.

Loss of Context in Predictive Coding

Feature Representation

Most machine learning models, including SVM and Logistic Regression, require numerical input features. For text documents, common representations include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings.

These representations often fail to capture the order of words, phrases, or the broader context within which terms appear. For instance, BoW treats a document as an unordered set of words, entirely losing the sequence in which words appear.

Ambiguity and Polysemy

Many words have multiple meanings depending on their context. It’s easy to misinterpret a word’s intent without considering the surrounding text. For instance, “file” can mean a document in one context or a tool to smooth surfaces in another. The algorithm might misclassify documents based on such words if the broader context is not considered.

Negations and Complex Sentences

Simple feature representations might not effectively capture negations or complex sentence structures. For example, the algorithm might treat the sentence “The document is not relevant to the case” and “The document is relevant to the case” similarly due to the significant overlap in terms, even though their meanings are opposite.

Loss of Inter-document Context

Sometimes, the relevance or meaning of a document can be inferred from its relationship with other documents. SVM and Logistic Regression typically treat each document as an independent entity, missing out on the potential context provided by related documents.

Biased Training Data

Biased training data is one of the most significant challenges in machine learning, and its implications have become particularly pronounced in document review for eDiscovery. The Review’s accuracy, fairness, and impartiality are paramount in this process. Any explicit or implicit biases can have profound legal and ethical implications.

Selection Bias

Selection bias arises when the subset of documents used to train the machine learning model does not represent the entire set of documents. Suppose only certain types of documents (e.g., emails but not memos) or documents from specific time periods or custodians are selected. In that case, the model might be trained to recognize only specific patterns and miss out on others. A model trained with selection bias may overlook critical evidence or disproportionately flag irrelevant documents, leading to inefficient reviews.

Label Bias

Label bias occurs when the annotations (labels) given to documents during the training phase are influenced by the reviewers’ preconceptions, beliefs, or other external factors. The model will inherit this bias if a reviewer consistently labels documents from a particular source or author as relevant (or irrelevant) due to some inherent belief or external influence.

This can lead to skewed results, where the model might over-represent or under-represent the importance of specific documents.

Pre-Processing Bias

Bias can be introduced during the pre-processing stage, especially when deciding what data to include, exclude or how to transform it. If specific keywords, phrases, or sections of documents are consistently removed or given undue weight during pre-processing, the model may develop a skewed understanding of the data. This might result in the model overlooking key evidentiary documents or over-prioritizing less relevant ones.

As we conclude the first part of our exploration into eDiscovery, we’ve seen how Predictive Coding—despite its advancements—encounters significant hurdles such as overfitting, loss of context, and training data biases. These challenges impact the accuracy and efficiency of the review process and pose critical questions about the future of eDiscovery technology. Moving forward, we focus on Continuous Active Learning (CAL), a promising evolution in the field that aims to address some of these limitations while presenting its unique challenges.

About the Author
Picture of VASUDEVA MAHAVISHNU
VASUDEVA MAHAVISHNU

Vasudeva is the CTO at Altumatim. Vasu brings his natural curiosity and passion for using technology to improve access to justice and our quality of life to the Altumatim team as he architects and builds out the future of discovery. Vasu blends computer science and data science expertise from computational genomics with published work ranging from gene mapping to developing probabilistic models for protein interactions in humans.

Share the Post:

Related Posts

Stay Ahead with Altumatim's Insider Newsletter!

Discover the future of eDiscovery and investigation, powered by artificial intelligence. Subscribe to our newsletter and be the first to get insights, tips, and updates from our team.

Subscribe to Altumatim News
Scroll to Top