Uploading data to a NER Project
Accessing the upload screen
Once a Project has been created and opened, Upload can be accessed using the sidebar.
Current supported data formats
- Plain text
- JSONL
- IOB-style data
- IOB(1)
- IOB2
- IOBES
- CONLL2003
- BILOU
- TSV
Uploading data
To upload the data, either click on the drag and drop files here
section or drag and drop files into the UI element.
If uploading records in plain text, the UI supports adding the plain text content directly as a record. Paste the content into the section highlighted below and the content will be added as a record.
Duplicate file uploads are checked to prevent inconsistencies.
Form fields
Data format
The format of the uploaded data. Currently, the following data formats are supported: Supported data formats
Upload name
The name of the current upload
An upload name is generated if one is not provided
Tags
Tags that can be used to identify the upload
Tagging set of uploaded records appropriately would help in identifying data sources which causes change in training results when comparing training.
Is the data in Acharya format?
Select this option if the data being uploaded is in the default Acharya JSONL format default config
Text (txt)
This is a plain text upload. All content uploaded will be treated as record data not associated to a particular data format
IOB style data
Formats such as IOB, IOB2, IOBES, CONLL2003 and BILOU are actively supported
JSONL
JSON Lines is supported where each line contains a JSON string with the following keys:
The JSON Key represents the key of the JSON property in the record
(Fields marked * are required)
JSON Key | Type | Description |
---|---|---|
Data * | string | which denotes the actual training data |
EntityLabels | [][number, number, string] | list of entity labels with start index, end index and label |
Key | string | which denotes the record key |
Completed | number | which denotes record as pending = 0/ train = 1/ test = 2 |
Prev | string | previous record's key |
Next | string | next record's key |
For EntityLabels
the end
index is exclusive
For example consider a JSONL record
{"details":"Welcome to Acharya","entities":[[10,20,"Name"]]}
{"details":"Acharya is a data centric MLOps tool","entities":[[0,7,"Name"],[26,31,"Operation"]]}
the corresponding JSON map will be
{
"Data": "details",
"EntityLabels": "entities"
}
For the fields that are not provided in JSON map will be overridden with the default values
Default JSON map configuration
{
"Data": "data",
"EntityLabels": "meta_data",
"Key": "key",
"Completed": "completed",
"Prev": "prev",
"Next": "next"
}
Mark all the records in this upload
Here there are 3 options
- As Pending
- For Test/Evaluation
- For Training
As Pending
As pending will mark all the records in the data being uploaded to be pending (i.e awaiting action). Records marked as pending will not be part of any training or evaluation.
For Test/Evaluation
For Test/Evaluation will mark all the records in the data being uploaded for testing or evaluation only, it will not be used for training.
For Training
For Training will mark all the records in the data being uploaded to be used for training.
It is recommended to test your files before upload