Class SemanticChunkingConfiguration
- All Implemented Interfaces:
Serializable
,SdkPojo
,ToCopyableBuilder<SemanticChunkingConfiguration.Builder,
SemanticChunkingConfiguration>
Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.
With semantic chunking, each sentence is compared to the next to determine how similar they are. You specify a threshold in the form of a percentile, where adjacent sentences that are less similar than that percentage of sentence pairs are divided into separate chunks. For example, if you set the threshold to 90, then the 10 percent of sentence pairs that are least similar are split. So if you have 101 sentences, 100 sentence pairs are compared, and the 10 with the least similarity are split, creating 11 chunks. These chunks are further split if they exceed the max token size.
You must also specify a buffer size, which determines whether sentences are compared in isolation, or within a moving
context window that includes the previous and following sentence. For example, if you set the buffer size to
1
, the embedding for sentence 10 is derived from sentences 9, 10, and 11 combined.
- See Also:
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionfinal Integer
The dissimilarity threshold for splitting chunks.final Integer
The buffer size.builder()
final boolean
final boolean
equalsBySdkFields
(Object obj) Indicates whether some other object is "equal to" this one by SDK fields.final <T> Optional
<T> getValueForField
(String fieldName, Class<T> clazz) final int
hashCode()
final Integer
The maximum number of tokens that a chunk can contain.static Class
<? extends SemanticChunkingConfiguration.Builder> Take this object and create a builder that contains all of the current property values of this object.final String
toString()
Returns a string representation of this object.Methods inherited from interface software.amazon.awssdk.utils.builder.ToCopyableBuilder
copy
-
Method Details
-
breakpointPercentileThreshold
The dissimilarity threshold for splitting chunks.
- Returns:
- The dissimilarity threshold for splitting chunks.
-
bufferSize
-
maxTokens
The maximum number of tokens that a chunk can contain.
- Returns:
- The maximum number of tokens that a chunk can contain.
-
toBuilder
Description copied from interface:ToCopyableBuilder
Take this object and create a builder that contains all of the current property values of this object.- Specified by:
toBuilder
in interfaceToCopyableBuilder<SemanticChunkingConfiguration.Builder,
SemanticChunkingConfiguration> - Returns:
- a builder for type T
-
builder
-
serializableBuilderClass
-
hashCode
-
equals
-
equalsBySdkFields
Description copied from interface:SdkPojo
Indicates whether some other object is "equal to" this one by SDK fields. An SDK field is a modeled, non-inherited field in anSdkPojo
class, and is generated based on a service model.If an
SdkPojo
class does not have any inherited fields,equalsBySdkFields
andequals
are essentially the same.- Specified by:
equalsBySdkFields
in interfaceSdkPojo
- Parameters:
obj
- the object to be compared with- Returns:
- true if the other object equals to this object by sdk fields, false otherwise.
-
toString
-
getValueForField
-
sdkFields
-
sdkFieldNameToField
- Specified by:
sdkFieldNameToField
in interfaceSdkPojo
- Returns:
- The mapping between the field name and its corresponding field.
-