Skip to content

Supported file formats

Graphext allows upload of datasets in various file formats. For Graphext to know which importer to use, the filename's extension much match the kind of data it contains. The following is a list of supported formats, see below for further details regarding what we expect in each case.

  • CSV (.csv, .tsv)
    Comma-separated values. Tries to automatically detect the actual separator (comma, semi-colon, tab, etc.) as well as the file encoding (we recommend UTF-8).
  • JSON (.json)
    Automatically detects 3 supported JSON formats (JSON Lines, list-of-rows, object-of-columns).
  • Excel (.xlsx)
    Data needs to be on the first sheet and must represent a simple table (no initial headers etc.).
  • SPSS (.sav)
    Data will be imported with column names, types, labels etc. as specified in the file.
  • GML (.gml)
    The Graph Modelling Language format let's you import data already representing a graph/network.
  • Arrow (.arr)
    Binary Apache Arrow files written in streaming mode.
  • ZIP (.zip)
    Upload and concatenate multiple CSV files at once. You'll probably want all files in the archive to have the same structure (the same columns). If this is not the case, parts of the resulting table where a file didn't have a column that is present in other files will have missing values.

Note that instead of uploading a file manually you can also use our integrations to fetch data directly from Google Drive, Sheets or a database you have access to. For further details see the corresponding section in our help center.

For most formats Graphext will inspect the raw data to try and infer the correct type for each column (categorical, numeric, date etc.). This is not the case only for SPSS (.sav) and Arrow (.arr) files, which already come with reliable type information.

CSV

While there is no "official" CSV standard, most implementations follow some common rules. We recommend adhering to the following guidelines adapted from the Internet Engineering Task Force, which you may also access directly here.

  1. The first line in the file is a header line with the same format as normal record lines. This header contains names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example:

    field_1,field_2,field_3
    aaa,bbb,ccc
    zzz,yyy,xxx
    
  2. Each actual data record is located on a separate line, delimited by a line break

  3. The last record in the file may or may not have an ending line break

  4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and will not be ignored. The last field in the record must not be followed by a comma. For example:

    Good:

    field_1,field_2,field_3
    aaa,bbb,ccc
    zzz,yyy,xxx
    

    Bad:

    field_1,field_2,field_3
    aaa, bbb,ccc,
    zzz,yyy, xxx,
    
  5. Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

    "aaa","bbb","ccc"
    zzz,yyy,xxx
    
  6. Fields containing line breaks, double quotes, and commas must be enclosed in double-quotes. For example:

    "aaa","b
    bb","ccc"
    zzz,yyy,xxx
    
  7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

    "aaa","He said ""Hi!""","ccc"
    

JSON

We support three different JSON (JavaScript Object Notation) formats, which will be detected automatically by inspecting the beginning of a .json file.

Json Lines

In the JSON lines format, each line in the file is a JSON object representing a dataset row. The object in each row contains field names as keys and the corresponding field's value. For example:

{"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}
{"field_1": "zzz", "field_2": "yyy", "field_3": "xxx"}

For further details see the official JSON Lines documentation.

List of records

In this format the file contains a JSON list of objects, where each object contains field names and values as key-value pairs. For example:

[
    {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"},
    {"field_1": "zzz", "field_2": "yyy", "field_3": "xxx"}
]

Notice how the first level represents a list, and that objects within this list are separated by a comma. Line breaks and spaces between fields are not required, so the following is an equivalent but more compact format that is equally valid:

[{"field_1":"aaa","field_2":"bbb","field_3":"ccc"},{"field_1":"zzz","field_2":"yyy","field_3":"xxx"}]

Object of columns

The last supported JSON format is column-oriented. In this format the file contains at the highest level a JSON object. This object has key-value pairs where each key is the name of a field/column, and each value is a JSON list containing {index: value} objects, for example:

{
    "field_1": {0: "aaa", 1: "zzz"},
    "field_2": {0: "bbb", 1: "yyy"},
    "field_3": {0: "ccc", 1: "xxx"}
}

In this format, line breaks and spaces between fields are also ignored, and so the following is equivalent:

{"field_1":{0:"aaa",1:"zzz"},"field_2":{0:"bbb",1:"yyy"},"field_3":{0:"ccc",1:"xxx"}}

Automatic detection

As can be seen in the examples, each JSON format is easily identified by inspecting the first few lines of the file. We use the following heuristic:

  1. If the file starts with "[": assume the list-of-records format.

  2. If the file contains more than 1 line, and each of the first 2 lines starts with "{" and ends with "}", assume the JSON Lines format.

  3. In all other cases assume the object-of-columns format.