# Advanced data selection

Rediscover data exploration interactively

## Exploring with cross filters

Cross filters are one of the most powerful tools Graphext offers. They are natural to use, as they show the distribution of your variables, but enable exploration on different combinations of values, making the whole interface reactive.

For example, in this dataset holding transactions from an e-commerce, we can filter those transactions made between 2020 and 2022:

which leaves us with 72% of the data, 1.1M rows out of 1.6M we have in total.

Notice the relative scale on the right, spanning from 0 to 12% (really it’s more like ~13%). That is now telling us how much of our data lies on each of the bars (called bins).

This (or any) selection affects every other cross filter. This is what makes them so powerful: they all behave like one single system informing of the different distributions of your variables.

When selecting the category GIFT_CARD, we see a very prominent decrease in sales from the end of 2021 and onwards

It is worth noting that using cross filters affects the whole state of the application, meaning that Graph and Plot also react to whatever you are selecting.

## Sorting and filtering

Cross filters can also be sorted and searched, making surgically precise questions a breeze to answer.

### Sorting

You can sort categorical and text variables, in several ways. The default is “by everything”, which just means the frequency of each value sorted in descending order; the most common items appear first.

You also have these other methods available:

**Selection**: the same as “by everything” but just taking into account the current active selection**Uplift**: the difference in frequency between the selection and the whole dataset. Bigger differences will appear first.- TF-IDF: measures the importance of a term (or category) with respect to the whole dataset.
**Ordinal**: if you have provided ordinal information to your variable, you can sort it this way.**Alphabetically**: sort the categories alphabetically in descending order.

### Searching

Clicking the little magnifying glass in a cross filter will allow you to search through the different values it holds:

This allows you to pin any values you like, so they’re always visible and available to be selected.

## Custom Query Selection

Cross filters also allow you to make arbitrary selections, with a specific, but easy syntax that explains precisely what you need.

These are the operators you can use:

- logical:
`AND, OR, NOT`

- numerical:
`<, >, <=, >=`

- text:
`REGEX, SUBSTR, FUZZY`

- frequency selection:
`TOP, FREQ`

- sorting methods (to be used with
`TOP`

):`BACKGROUND, FOREGROUND, UPLIFT, TFIDF, ORDINAL`

- null values:
`NULL`

- statistics:
`MIN, MAX, MEAN, P25, MEDIAN, P75`

The query you build and the syntax available will depend on the type of the variable you are filtering:

**All**column types accept the**logical**operators to build complex queries using other operators, as well as the`NULL`

operator, which exclusively returns null rows.

### Categorical & Text

**Categorical** and **Text** variables accept the `TOP`

, `FREQ`

, `FUZZY, SUBSTR`

and `REGEX`

operators. They work similarly since
both are based on text, categories are just very short expressions, whereas tend to present a longer format.

- On any text-based variable, you can simply ask for
`ball`

, which will return all rows that contain the word`ball`

by itself. `FUZZY(ball)`

will return all rows that contain the word`ball`

, whether`ball`

is part of other words or appears by itself.`FUZZY`

is case insensitive and normalizes all input (a.k.a ASCII folding) before searching.

`SUBSTR(ball)`

will return only rows that contain`ball`

as part of a word, but not by itself.`REGEX()`

will accept a regular expression string to match more complex patterns.

Selecting FUZZY(ball) yields more results than just ball

`FREQ`

selects those terms whose frequency is greater or equal than`n`

:`FREQ(10000)`

selects those values whose frequency is greater or equal to 10K.`TOP`

selects the top`n`

terms in terms of frequency:`TOP(10)`

selects the top 10 terms by frequency.- we can, however, modify
`TOP`

’s behavior by saying`TOP(10, FOREGROUND)`

, which would select the top 10**out of the current selection we have made**. `TOP(10, UPLIFT)`

selects the top 10 after sorting them by how different the frequency is between the selection and the whole dataset. These operators are the same as the ones mentioned in the sorting section.

- we can, however, modify

Keep in mind that the sorting methods will not visually sort the elements in the cross filter, but they will just return the relevant elements once they are filtered and sorted. To sort the elements visually, you can head to the sorting section.

Since ELECTRONIC_CABLE has ~17K occurrencies, it was left out of the greater than 20K selection.

### Numerical & Dates

**Numerical** variables will naturally accept the numerical operators, as well as the statistical operators. Dates work similarly
because, internally, they’re just a really long number.

Simply selecting a range in the little plot will create a query string with the greater/less than and AND operators.

The statistical operators provide a quick shortcut to the most relevant sections of a distribution:

- MIN: selects all rows that have the same value as the minimum value found
- MAX: selects all rows that have the same value as the maximum value found
- MEDIAN: selects all rows that have the same value as the median
- P25: selects all rows that have the same value as the first quartile
- P75: selects all rows that have the same value as the third quartile

To build a query spanning from the first quartile to the third, you can simply say `>= P25 AND <= P75`

, which returns all rows that are both
greater or equal than the P25 and less or equal than the P75. This effectively returns the the interquartile range.

**Dates** behave in much the same way as numbers, except that they use the ISO 8601 representation for the date itself.

We can say things like `>= 2018-06-05T11:33:48.554Z AND <= 2021-06-26T05:56:18.172Z`

, which means anything *after* June 5th, 2018 at 11:33:48 **and** *before* June 26th, 2021 at 05:56:18, effectively
returning dates within that time interval. Same as with numbers, simply creating a range in the cross filter will generate this query for you to adjust, in case more precision was needed.

## Re-ordering cross filters

Just in case you missed it, you can group, pin and rearrange variables, so the most important information is always where you want it to be.

Was this page helpful?