Skip to content

Commit c4da3c9

Browse files
authored
chore: improve cast documentation to add support per eval mode (apache#3056)
1 parent 009bf47 commit c4da3c9

File tree

2 files changed

+164
-136
lines changed

2 files changed

+164
-136
lines changed

docs/source/user-guide/latest/compatibility.md

Lines changed: 95 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -73,122 +73,118 @@ should not be used in production. The feature will be enabled in a future releas
7373

7474
Cast operations in Comet fall into three levels of support:
7575

76-
- **Compatible**: The results match Apache Spark
77-
- **Incompatible**: The results may match Apache Spark for some inputs, but there are known issues where some inputs
76+
- **C (Compatible)**: The results match Apache Spark
77+
- **I (Incompatible)**: The results may match Apache Spark for some inputs, but there are known issues where some inputs
7878
will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting
7979
`spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not
8080
recommended for production use.
81-
- **Unsupported**: Comet does not provide a native version of this cast expression and the query stage will fall back to
81+
- **U (Unsupported)**: Comet does not provide a native version of this cast expression and the query stage will fall back to
8282
Spark.
83+
- **N/A**: Spark does not support this cast.
8384

84-
### Compatible Casts
85+
### Legacy Mode
8586

86-
The following cast operations are generally compatible with Spark except for the differences noted here.
87+
<!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
88+
89+
<!--BEGIN:CAST_LEGACY_TABLE-->
90+
<!-- prettier-ignore-start -->
91+
| | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
92+
|---|---|---|---|---|---|---|---|---|---|---|---|---|
93+
| binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
94+
| boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
95+
| byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
96+
| date | N/A | U | U | - | U | U | U | U | U | U | C | U |
97+
| decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
98+
| double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
99+
| float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
100+
| integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
101+
| long | U | C | C | N/A | C | C | C | C | - | C | C | U |
102+
| short | U | C | C | N/A | C | C | C | C | C | - | C | U |
103+
| string | C | C | C | C | I | C | C | C | C | C | - | I |
104+
| timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
105+
<!-- prettier-ignore-end -->
106+
107+
**Notes:**
108+
109+
- **decimal -> string**: There can be formatting differences in some case due to Spark using scientific notation where Comet does not
110+
- **double -> decimal**: There can be rounding differences
111+
- **double -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
112+
- **float -> decimal**: There can be rounding differences
113+
- **float -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
114+
- **string -> date**: Only supports years between 262143 BC and 262142 AD
115+
- **string -> decimal**: Does not support fullwidth unicode digits (e.g \\uFF10)
116+
or strings containing null bytes (e.g \\u0000)
117+
- **string -> timestamp**: Not all valid formats are supported
118+
<!--END:CAST_LEGACY_TABLE-->
119+
120+
### Try Mode
87121

88122
<!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
89123

90-
<!--BEGIN:COMPAT_CAST_TABLE-->
124+
<!--BEGIN:CAST_TRY_TABLE-->
91125
<!-- prettier-ignore-start -->
92-
| From Type | To Type | Notes |
93-
|-|-|-|
94-
| boolean | byte | |
95-
| boolean | short | |
96-
| boolean | integer | |
97-
| boolean | long | |
98-
| boolean | float | |
99-
| boolean | double | |
100-
| boolean | string | |
101-
| byte | boolean | |
102-
| byte | short | |
103-
| byte | integer | |
104-
| byte | long | |
105-
| byte | float | |
106-
| byte | double | |
107-
| byte | decimal | |
108-
| byte | string | |
109-
| short | boolean | |
110-
| short | byte | |
111-
| short | integer | |
112-
| short | long | |
113-
| short | float | |
114-
| short | double | |
115-
| short | decimal | |
116-
| short | string | |
117-
| integer | boolean | |
118-
| integer | byte | |
119-
| integer | short | |
120-
| integer | long | |
121-
| integer | float | |
122-
| integer | double | |
123-
| integer | decimal | |
124-
| integer | string | |
125-
| long | boolean | |
126-
| long | byte | |
127-
| long | short | |
128-
| long | integer | |
129-
| long | float | |
130-
| long | double | |
131-
| long | decimal | |
132-
| long | string | |
133-
| float | boolean | |
134-
| float | byte | |
135-
| float | short | |
136-
| float | integer | |
137-
| float | long | |
138-
| float | double | |
139-
| float | string | There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45 |
140-
| double | boolean | |
141-
| double | byte | |
142-
| double | short | |
143-
| double | integer | |
144-
| double | long | |
145-
| double | float | |
146-
| double | string | There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45 |
147-
| decimal | boolean | |
148-
| decimal | byte | |
149-
| decimal | short | |
150-
| decimal | integer | |
151-
| decimal | long | |
152-
| decimal | float | |
153-
| decimal | double | |
154-
| decimal | decimal | |
155-
| decimal | string | There can be formatting differences in some case due to Spark using scientific notation where Comet does not |
156-
| string | boolean | |
157-
| string | byte | |
158-
| string | short | |
159-
| string | integer | |
160-
| string | long | |
161-
| string | float | |
162-
| string | double | |
163-
| string | binary | |
164-
| string | date | Only supports years between 262143 BC and 262142 AD |
165-
| binary | string | |
166-
| date | string | |
167-
| timestamp | long | |
168-
| timestamp | string | |
169-
| timestamp | date | |
126+
| | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
127+
|---|---|---|---|---|---|---|---|---|---|---|---|---|
128+
| binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
129+
| boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
130+
| byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
131+
| date | N/A | U | U | - | U | U | U | U | U | U | C | U |
132+
| decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
133+
| double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
134+
| float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
135+
| integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
136+
| long | U | C | C | N/A | C | C | C | C | - | C | C | U |
137+
| short | U | C | C | N/A | C | C | C | C | C | - | C | U |
138+
| string | C | C | C | C | I | C | C | C | C | C | - | I |
139+
| timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
170140
<!-- prettier-ignore-end -->
171-
<!--END:COMPAT_CAST_TABLE-->
172141

173-
### Incompatible Casts
142+
**Notes:**
174143

175-
The following cast operations are not compatible with Spark for all inputs and are disabled by default.
144+
- **decimal -> string**: There can be formatting differences in some case due to Spark using scientific notation where Comet does not
145+
- **double -> decimal**: There can be rounding differences
146+
- **double -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
147+
- **float -> decimal**: There can be rounding differences
148+
- **float -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
149+
- **string -> date**: Only supports years between 262143 BC and 262142 AD
150+
- **string -> decimal**: Does not support fullwidth unicode digits (e.g \\uFF10)
151+
or strings containing null bytes (e.g \\u0000)
152+
- **string -> timestamp**: Not all valid formats are supported
153+
<!--END:CAST_TRY_TABLE-->
154+
155+
### ANSI Mode
176156

177157
<!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
178158

179-
<!--BEGIN:INCOMPAT_CAST_TABLE-->
159+
<!--BEGIN:CAST_ANSI_TABLE-->
180160
<!-- prettier-ignore-start -->
181-
| From Type | To Type | Notes |
182-
|-|-|-|
183-
| float | decimal | There can be rounding differences |
184-
| double | decimal | There can be rounding differences |
185-
| string | decimal | Does not support fullwidth unicode digits (e.g \\uFF10)
186-
or strings containing null bytes (e.g \\u0000) |
187-
| string | timestamp | Not all valid formats are supported |
161+
| | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
162+
|---|---|---|---|---|---|---|---|---|---|---|---|---|
163+
| binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
164+
| boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
165+
| byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
166+
| date | N/A | U | U | - | U | U | U | U | U | U | C | U |
167+
| decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
168+
| double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
169+
| float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
170+
| integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
171+
| long | U | C | C | N/A | C | C | C | C | - | C | C | U |
172+
| short | U | C | C | N/A | C | C | C | C | C | - | C | U |
173+
| string | C | C | C | C | I | C | C | C | C | C | - | I |
174+
| timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
188175
<!-- prettier-ignore-end -->
189-
<!--END:INCOMPAT_CAST_TABLE-->
190176

191-
### Unsupported Casts
177+
**Notes:**
178+
179+
- **decimal -> string**: There can be formatting differences in some case due to Spark using scientific notation where Comet does not
180+
- **double -> decimal**: There can be rounding differences
181+
- **double -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
182+
- **float -> decimal**: There can be rounding differences
183+
- **float -> string**: There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
184+
- **string -> date**: Only supports years between 262143 BC and 262142 AD
185+
- **string -> decimal**: Does not support fullwidth unicode digits (e.g \\uFF10)
186+
or strings containing null bytes (e.g \\u0000)
187+
- **string -> timestamp**: ANSI mode not supported
188+
<!--END:CAST_ANSI_TABLE-->
192189

193-
Any cast not listed in the previous tables is currently unsupported. We are working on adding more. See the
194-
[tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details.
190+
See the [tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details.

spark/src/main/scala/org/apache/comet/GenerateDocs.scala

Lines changed: 69 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,14 @@ package org.apache.comet
2121

2222
import java.io.{BufferedOutputStream, BufferedReader, FileOutputStream, FileReader}
2323

24+
import scala.collection.mutable
2425
import scala.collection.mutable.ListBuffer
2526

2627
import org.apache.spark.sql.catalyst.expressions.Cast
2728

2829
import org.apache.comet.CometConf.COMET_ONHEAP_MEMORY_OVERHEAD
2930
import org.apache.comet.expressions.{CometCast, CometEvalMode}
30-
import org.apache.comet.serde.{Compatible, Incompatible, QueryPlanSerde}
31+
import org.apache.comet.serde.{Compatible, Incompatible, QueryPlanSerde, Unsupported}
3132

3233
/**
3334
* Utility for generating markdown documentation from the configs.
@@ -109,48 +110,79 @@ object GenerateDocs {
109110
val w = new BufferedOutputStream(new FileOutputStream(filename))
110111
for (line <- lines) {
111112
w.write(s"${line.stripTrailing()}\n".getBytes)
112-
if (line.trim == "<!--BEGIN:COMPAT_CAST_TABLE-->") {
113-
w.write("<!-- prettier-ignore-start -->\n".getBytes)
114-
w.write("| From Type | To Type | Notes |\n".getBytes)
115-
w.write("|-|-|-|\n".getBytes)
116-
for (fromType <- CometCast.supportedTypes) {
117-
for (toType <- CometCast.supportedTypes) {
118-
if (Cast.canCast(fromType, toType) && (fromType != toType || fromType.typeName
119-
.contains("decimal"))) {
120-
val fromTypeName = fromType.typeName.replace("(10,2)", "")
121-
val toTypeName = toType.typeName.replace("(10,2)", "")
122-
CometCast.isSupported(fromType, toType, None, CometEvalMode.LEGACY) match {
123-
case Compatible(notes) =>
124-
val notesStr = notes.getOrElse("").trim
125-
w.write(s"| $fromTypeName | $toTypeName | $notesStr |\n".getBytes)
126-
case _ =>
113+
if (line.trim == "<!--BEGIN:CAST_LEGACY_TABLE-->") {
114+
writeCastMatrixForMode(w, CometEvalMode.LEGACY)
115+
} else if (line.trim == "<!--BEGIN:CAST_TRY_TABLE-->") {
116+
writeCastMatrixForMode(w, CometEvalMode.TRY)
117+
} else if (line.trim == "<!--BEGIN:CAST_ANSI_TABLE-->") {
118+
writeCastMatrixForMode(w, CometEvalMode.ANSI)
119+
}
120+
}
121+
w.close()
122+
}
123+
124+
private def writeCastMatrixForMode(w: BufferedOutputStream, mode: CometEvalMode.Value): Unit = {
125+
val sortedTypes = CometCast.supportedTypes.sortBy(_.typeName)
126+
val typeNames = sortedTypes.map(_.typeName.replace("(10,2)", ""))
127+
128+
// Collect annotations for meaningful notes
129+
val annotations = mutable.ListBuffer[(String, String, String)]()
130+
131+
w.write("<!-- prettier-ignore-start -->\n".getBytes)
132+
133+
// Write header row
134+
w.write("| |".getBytes)
135+
for (toTypeName <- typeNames) {
136+
w.write(s" $toTypeName |".getBytes)
137+
}
138+
w.write("\n".getBytes)
139+
140+
// Write separator row
141+
w.write("|---|".getBytes)
142+
for (_ <- typeNames) {
143+
w.write("---|".getBytes)
144+
}
145+
w.write("\n".getBytes)
146+
147+
// Write data rows
148+
for ((fromType, fromTypeName) <- sortedTypes.zip(typeNames)) {
149+
w.write(s"| $fromTypeName |".getBytes)
150+
for ((toType, toTypeName) <- sortedTypes.zip(typeNames)) {
151+
val cell = if (fromType == toType) {
152+
"-"
153+
} else if (!Cast.canCast(fromType, toType)) {
154+
"N/A"
155+
} else {
156+
val supportLevel = CometCast.isSupported(fromType, toType, None, mode)
157+
supportLevel match {
158+
case Compatible(notes) =>
159+
notes.filter(_.trim.nonEmpty).foreach { note =>
160+
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
127161
}
128-
}
129-
}
130-
}
131-
w.write("<!-- prettier-ignore-end -->\n".getBytes)
132-
} else if (line.trim == "<!--BEGIN:INCOMPAT_CAST_TABLE-->") {
133-
w.write("<!-- prettier-ignore-start -->\n".getBytes)
134-
w.write("| From Type | To Type | Notes |\n".getBytes)
135-
w.write("|-|-|-|\n".getBytes)
136-
for (fromType <- CometCast.supportedTypes) {
137-
for (toType <- CometCast.supportedTypes) {
138-
if (Cast.canCast(fromType, toType) && fromType != toType) {
139-
val fromTypeName = fromType.typeName.replace("(10,2)", "")
140-
val toTypeName = toType.typeName.replace("(10,2)", "")
141-
CometCast.isSupported(fromType, toType, None, CometEvalMode.LEGACY) match {
142-
case Incompatible(notes) =>
143-
val notesStr = notes.getOrElse("").trim
144-
w.write(s"| $fromTypeName | $toTypeName | $notesStr |\n".getBytes)
145-
case _ =>
162+
"C"
163+
case Incompatible(notes) =>
164+
notes.filter(_.trim.nonEmpty).foreach { note =>
165+
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
146166
}
147-
}
167+
"I"
168+
case Unsupported(_) =>
169+
"U"
148170
}
149171
}
150-
w.write("<!-- prettier-ignore-end -->\n".getBytes)
172+
w.write(s" $cell |".getBytes)
173+
}
174+
w.write("\n".getBytes)
175+
}
176+
177+
w.write("<!-- prettier-ignore-end -->\n".getBytes)
178+
179+
// Write annotations if any
180+
if (annotations.nonEmpty) {
181+
w.write("\n**Notes:**\n".getBytes)
182+
for ((from, to, note) <- annotations.distinct) {
183+
w.write(s"- **$from -> $to**: $note\n".getBytes)
151184
}
152185
}
153-
w.close()
154186
}
155187

156188
/** Read file into memory */

0 commit comments

Comments
 (0)