Textual data

Raw & pure textual data

Textual can help you

Obtaining data

We can extract data from following sources:

  • a URL (http or https)
  • a path to a textual file (.txt, not .doc/.docx)
  • an interactive interface in the notebook, asking to upload the file

Prepare model training

We can prepare model training in the following way:

  • Pytorch datasets, data collector for dataloader
  • Create a hugging face trainer

Textual.create_datasets[source]

Textual.create_datasets(tokenizer, valid_ratio:float=0.2, train_path:str='./train_text.txt', val_path:str='./val_text.txt', block_size:int=128, mlm:bool=False)

Create pytorch datasets and collating functions

  • tokenizer: a huggingface tokenizer
  • valid ratio: portion of the valid data, compare to the entire dataset
  • train_path: a path saving train text file
  • val_path: a path saving valid text file
  • block_size: max possible length of the sequence
  • mlm, return a masked language modeling collating default False

Textual.create_train_val[source]

Textual.create_train_val(valid_ratio=0.2, train_path='./train_text.txt', val_path='./val_text.txt')

create 2 files:

  • ./train_text.txt
  • ./val_text.txt

Textual.from_path[source]

Textual.from_path(path:Union[Path, str])

Load a textual object from system path

Textual.from_upload[source]

Textual.from_upload(path:Path=Path('uploaded_file.txt'))

Load textual with interactive upload button

Textual.from_url[source]

Textual.from_url(url:str)

Create a Textual object from a URL

Textual.get_hf_trainer[source]

Textual.get_hf_trainer(model, tokenizer, arguments=None, valid_ratio:float=0.2, train_path:str='./train_text.txt', val_path:str='./val_text.txt', block_size:int=128, mlm:bool=False)

Create a hugging face trainer

Textual.show_batch[source]

Textual.show_batch(tokenizer, bs:int=4)

Examples

Create interactive uploading widget

uploaded = Textual.from_upload()
textual = uploaded()
textual

Preview textual data by page

textual()

HFTextBlock

A TransformBlock, designed for huggingface tokenizers

HFTextBlock[source]

HFTextBlock(tokenizer, **tk_kwargs)

create Huggingface specialized fastai Block tokenizer: PreTrainedTokenizer from huggingface tk_kwargs: keyward arguments for tokenizer's __call__

Tests

# For Test Cases (might have duplicate import because it will be in a dedicated file)
from pathlib import Path
from typing import List

import pytest
from test_common.utils_4_tests import DATA_DIR
from test_utils import GITHUB_TEST_DATA_URL, check_connection_github
LOCAL_TEST_TXT = DATA_DIR / "to_download.txt"
GITHUB_TEST_TXT = f"{GITHUB_TEST_DATA_URL}/to_download.txt"

LOCAL_TEST_TXT_UTF8 = DATA_DIR / "Deep learning.txt"
GITHUB_TEST_TXT_UTF8 = f"{GITHUB_TEST_DATA_URL}/Deep%20learning.txt"


@pytest.fixture(scope="session")
def test_txt_content():
    return LOCAL_TEST_TXT.read_text()


@pytest.fixture(scope="session")
def local_textual(test_txt_content):
    return Textual(test_txt_content)


class Test_Textual:
    def test_init(self, local_textual, test_txt_content):
        """Test initialization of Textual from text"""
        expected_txt = test_txt_content.replace("\n", " ").replace("\r", "")
        assert local_textual.text == expected_txt

    def test_init_encoding(self):
        """Test initialization of Textual from text"""
        content = LOCAL_TEST_TXT_UTF8.read_text(encoding="utf-8")
        textual = Textual(content)
        assert textual.text == content.replace("\n", " ").replace("\r", "")

    def test_from_path(self, local_textual):
        """Test create Textual from path (existing)"""
        textual = Textual.from_path(LOCAL_TEST_TXT)
        assert textual.text == local_textual.text

    def test_from_path_error(self):
        """Test extract Textual of file that does not exist"""
        with pytest.raises(FileExistsError):
            textual = Textual.from_path("does_not_exist.txt")

    @pytest.mark.github
    def test_from_url(self, check_connection_github, local_textual):
        """Test extract Textual from URL"""
        textual = Textual.from_url(GITHUB_TEST_TXT)
        assert textual.text == local_textual.text, f"URL text: {textual.text}"

    @pytest.mark.github
    def test_from_url_non_ascii(self, check_connection_github):
        """Test extract Textual from URL with non-ascii characters"""
        textual = Textual.from_url(GITHUB_TEST_TXT_UTF8)
        content = LOCAL_TEST_TXT_UTF8.read_text(encoding="utf-8")
        content = content.replace("\n", " ").replace("\r", "")
        assert textual.text == content, f"URL text: {textual.text}"