Electronic discovery (“eDiscovery”) is the process of identifying, gathering, and analyzing electronically stored information (ESI) for use in legal proceedings. As the volume of ESI proliferates, so does the necessity for sophisticated methods to navigate and interpret this vast information landscape. While Predictive coding and Continuous Active Learning have been heralded as frontrunners in addressing these challenges, they come with constraints. Meanwhile, newer technology offers a promising alternative, potentially reshaping the eDiscovery paradigm.
Predictive Coding in eDiscovery
Overfitting in Predictive Coding
Overfitting is a common problem in machine learning and predictive modeling. It occurs when a model captures the noise or random fluctuations in the training data rather than the underlying distribution. When applied to predictive coding in eDiscovery, overfitting can have serious consequences, especially in the context of the document review process.
In the context of eDiscovery, if the initial set of documents used to train the predictive coding model is too small or not representative of the entire dataset, the model might learn patterns specific to such a limited or non-representative subset, which do not generalize well. Highly complex models with many parameters can closely fit the training data, capturing even its noise. While these models might achieve high accuracy on the training set, they often perform poorly on new, unseen data.
Implications of Overfitting in the Document Review Process
Overfitted models can result in a high number of false positives or false negatives when classifying new documents. It can lead to relevant documents being overlooked or irrelevant documents being mistakenly flagged for review.
If a model falsely identifies too many relevant documents, it can increase manual review costs. Conversely, if relevant documents are missed, increased costs and other negative consequences may be associated with missed evidence or compliance issues.
Overfitting can lead to stakeholders losing confidence in the predictive coding process. If legal teams cannot trust the tool to provide accurate predictions, they might revert to manual review processes or be hesitant to use predictive coding in the future.
Loss of Context in Predictive Coding
Most machine learning models, including SVM and Logistic Regression, require numerical input features. For text documents, common representations include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings.
These representations often fail to capture the order of words, phrases, or the broader context within which terms appear. For instance, BoW treats a document as an unordered set of words, entirely losing the sequence in which words appear.
Ambiguity and Polysemy
Many words have multiple meanings depending on their context. It’s easy to misinterpret a word’s intent without considering the surrounding text. For instance, “file” can mean a document in one context or a tool to smooth surfaces in another. The algorithm might misclassify documents based on such words if the broader context is not considered.
Negations and Complex Sentences
Simple feature representations might not effectively capture negations or complex sentence structures. For example, the algorithm might treat the sentence “The document is not relevant to the case” and “The document is relevant to the case” similarly due to the significant overlap in terms, even though their meanings are opposite.
Loss of Inter-document Context
Sometimes, the relevance or meaning of a document can be inferred from its relationship with other documents. SVM and Logistic Regression typically treat each document as an independent entity, missing out on the potential context provided by related documents.
Biased Training Data
Biased training data is one of the most significant challenges in machine learning, and its implications have become particularly pronounced in document review for eDiscovery. The Review’s accuracy, fairness, and impartiality are paramount in this process. Any explicit or implicit biases can have profound legal and ethical implications.
Selection bias arises when the subset of documents used to train the machine learning model does not represent the entire set of documents. Suppose only certain types of documents (e.g., emails but not memos) or documents from specific time periods or custodians are selected. In that case, the model might be trained to recognize only specific patterns and miss out on others. A model trained with selection bias may overlook critical evidence or disproportionately flag irrelevant documents, leading to inefficient reviews.
Label bias occurs when the annotations (labels) given to documents during the training phase are influenced by the reviewers’ preconceptions, beliefs, or other external factors. The model will inherit this bias if a reviewer consistently labels documents from a particular source or author as relevant (or irrelevant) due to some inherent belief or external influence.
This can lead to skewed results, where the model might over-represent or under-represent the importance of specific documents.
As we conclude the first part of our exploration into eDiscovery, we’ve seen how Predictive Coding—despite its advancements—encounters significant hurdles such as overfitting, loss of context, and training data biases. These challenges impact the accuracy and efficiency of the review process and pose critical questions about the future of eDiscovery technology. Moving forward, we focus on Continuous Active Learning (CAL), a promising evolution in the field that aims to address some of these limitations while presenting its unique challenges.