parse
Returns a DataFrame in which the given String and Char columns are parsed into other types.
This is a special case of the convert operation.
This parsing operation is sometimes executed implicitly, for example, when reading from CSV or type converting from String/Char columns. You can recognize this by the locale or parserOptions arguments in these functions.
Related operations: Update / convert values
When no columns are specified, all String and Char columns are parsed, even those inside column groups and inside frame columns.
To parse only particular columns, use a column selector:
Parsing Order
parse tries to parse every String/Char column into one of the supported types in the following order:
Type | Notes |
|---|---|
| |
| |
| (K) - Requires |
| (K) - Requires |
Custom Kotlin date-time types ( | (K) - Where a |
Custom Java date-time types ( | (J) - Where a |
Custom global Kotlin date-time types ( | (K, P) - Where a |
Custom global Java date-time types ( | (J, P) - Where a |
Default Kotlin date-time ISO types ( | (K, P) |
Default Java date-time ISO types ( | (J, P) |
| (K) |
| (J) |
| (K) |
| (K) |
| |
| with "C.UTF-8" locale, used as fallback for |
| "true" / "false", "t" / "f", "yes" / "no" with any capitalization. |
| Requires |
| |
| |
| Requires the |
| |
|
You can get this list by accessing availableParserTypes on the Global Parser Options as well.
When .parse() is called on a single column and the input (String/Char) type is the same as the output type, (a.k.a., it cannot be parsed further) an IllegalStateException is thrown. To avoid this, use col.tryParse() instead.
Parser Options
DataFrame supports multiple parser options that can be used to customize the parsing behavior. These can be supplied to the parse function (or any other function that can implicitly parse Strings) as an argument.
For each option you don't supply (or supply null) DataFrame will take the value from the Global Parser Options.
Available parser options:
locale: Localeis used to parse doubles (and Java date-time types)Global default locale is
Locale.getDefault()
dateTime: DateTimeParserOptionscan be used to force parsing to Kotlin-, or Java date-time classes, and override default and custom global date-time formats. By default, it'snull, meaning we try Kotlin types first, and if that fails, we try Java types. See Parsing Date-time Strings. This argument was added in DataFrame 1.0.0-Beta5.nullStrings: List<String>is used to treat particular strings asnullvalueGlobal default null strings are
["null", "NULL", "NA", "N/A"].When reading from CSV, these are expanded to
["", "NA", "N/A", "null", "NULL", "None", "none", "NIL", "nil"]. See the KDocs there for the exact details
skipTypes: Set<KType>types that should be skipped during parsingEmpty set by global default; parsing can result in any supported type.
useFastDoubleParser: Booleanis used to enable or disable the new fast double parserEnabled by global default
parseExperimentalUuid: Booleanis used to enable or disable parsing to the experimentalkotlin.uuid.Uuidclass.Disabled by global default
parseExperimentalInstant: Booleanis used to enable or disable parsing to thekotlin.time.Instantclass, available from Kotlin 2.1+. Will parse tokotlinx.datetime.Instantiffalse.Disabled by global default, enabled in DataFrame 1.0.0-Beta5.
Global Parser Options
As mentioned before, you can change the default global parser options that will be used by read, convert, and other parse operations. Whenever you don't explicitly provide parser options to a function call or leave any of its arguments null, DataFrame will use these global options instead.
For example, to change the locale to French and add a custom Java date-time pattern for all following DataFrame calls, do:
For locale, this means that the one being used by the parser is defined as:
↪ The locale given as function argument directly, or in parserOptions, if it is not null, else
↪ The locale set by DataFrame.parser.locale = ..., if it is not null, else
↪ Locale.getDefault(), which is the system's default locale that can be changed with Locale.setDefault().
Global Parser Options can also be adjusted to change whether some parsers are included or excluded in a parsing call:
parseExperimentalUuid,parseExperimentalInstantskipTypesdateTimeLibrary(JAVA,KOTLIN, ornull)
These settings, however, will only affect functions that call parse(). They will not affect the behavior of convert operations (with useFastDoubleParser being the exception).
In other words:
Global parser options can always be reset to default by calling:
Parsing Doubles
DataFrame has a new fast and powerful double parser enabled by default. It is based on the FastDoubleParser library for its high performance and configurability (in the future, we might expand this support to Float, BigDecimal, and BigInteger as well).
The parser is locale-aware; it will use the locale set by the (global) parser options to parse the doubles. It also has a fallback mechanism built in, meaning it can recognize characters from all other locales (and some from Wikipedia) and parse them correctly as long as they don't conflict with the current locale.
For example, if your locale uses ',' as decimal separator, it will not recognize ',' as thousands separator, but it will recognize ''', ' ', '٬', '_', ' ', etc. as such. The same holds for characters like "e", "inf", "×10^", "NaN", etc. (ignoring case).
This means you can safely parse "123'456 789,012.345×10^6" with a US locale but not "1.234,5".
Aside from this, DataFrame also explicitly recognizes "∞", "inf", "infinity", and "infty" as Double.POSITIVE_INFINITY (as well as their negative counterparts), "nan", "na", and "n/a" as Double.NaN, and all forms of whitespace are treated equally.
If FastDoubleParser fails to parse a String as Double, DataFrame will try to parse it using the standard NumberFormat.parse() function as a last resort.
If you experience any issues with the new parser, you can turn it off by setting useFastDoubleParser = false, which will use the old NumberFormat.parse() function instead.
Please report any issues you encounter.
Parsing Date-time Strings
By default, DataFrame tries parsing date-time strings using
Custom global Kotlin-, and Java date-time formats, if provided;
Default Kotlin-, and Java ISO date-time formats.
You can customize this behavior from the Global Parser Options by:
Providing custom date-time formats/formatters and/or custom date-time patterns:
For Kotlin date-time types:
addDateTimeFormat<T>(format),addDateTimeUnicodePattern<T>(pattern)For Java date-time types:
addJavaDateTimeFormatter<T>(formatter),addJavaDateTimePattern<T>(pattern);
Forcing one or the other date-time format type by changing
dateTimeLibrarytoKOTLINorJAVA(by default,
null; both can be parsed to, but Kotlin has priority).
Resetting to default formats;
Example, parsing a date-time string by adding a custom format to global parser options:
Example, parsing a date-time string by adding a custom pattern to global parser options:
It is only possible to supply patterns or formats in a supported date-time type. For Kotlin, these are LocalDateTime, LocalDate, LocalTime, YearMonth, UtcOffset, and DateTimeComponents (a.k.a. all kotlinx-datetime types that have a .Format {} builder).
For Java, these are LocalDateTime, LocalDate, LocalTime, and Instant. We might expand these in the future. Let us know if you need any other types.
ParserOptions.dateTime: DateTimeParserOptions:
If a parsing function is provided with ParserOptions and ParserOptions.dateTime is not null, the global dateTimeLibrary parser option will be overridden.
Concretely, ParserOptions(dateTime = DateTimeParserOptions.Java) is equivalent to having DataFrame.parser.dateTimeLibrary = ParseDateTimeLibrary.JAVA for that particular function call. In addition, if that DateTimeParserOptions has any custom formats or patterns, the custom- and default global formats will be ignored, allowing you to essentially override them.
The two DateTimeParserOptions can be created from a set of type-format(ter) pairs, or using a builder-like pattern:
Example, parsing a date-time string by adding a custom format to parser options:
Example, parsing a date-time string by adding a custom pattern to parser options:
Some functions, like convertToLocalDate(), take a DateTimeFormat or Unicode date-time pattern directly. This is a shortcut for ParserOptions+DateTimeParserOptions that behaves exactly the same as the builder-like pattern above.
Java Locale argument:
DateTimeParserOptions.Java has a locale argument. This can adjust the locale used for parsing date-time strings and can have a different value than the locale in Parser Options. If null, ParserOptions.locale will be used instead. If that is null, we default to Global Parser Options, and finally to the default system locale.
Kotlin DateTimeComponents fallback mechanism:
When using DataFrame.convert or DataColumn.convertTo to convert from String to a kotlinx-datetime type, like LocalDate, fails to parse, the DateTimeComponents fallback-mechanism kicks in. Oftentimes it may namely be possible to parse the date-time string to the more flexible DateTimeComponents first and then convert that to LocalDate with a potential little loss of information. This means we can successfully call:
even though
would produce a DateTimeComponents column.
Take this mechanism into account when providing custom DateTimeFormats to the (global) ParserOptions.