Interim results specification

Live results are provided at a fixed rate. The start (st) and end (et) times are given in seconds for each result and per output to avoid confusion. The current time step is set to 3 seconds for convenience. Currently provided features are alerts, intensity, vocal variety, speaking rate, strength, positivity, politeness, engagement, and emotion. There is also an id property that maps to the sequence and/or the process id, as well as an is_final property that flags the end of the streaming. More specifically, the outputs are the following:

  • Intensity indicates the degree of the signal energy and has a value between 0 and 10 per speaker.
  • Vocal variety indicates the degree of sound variations within speech and has a value between 0 and 10 per speaker. Low vocal variety means that a speaker is monotonous.
  • Speaking rate describes the speaking rate of speaker speech and can have a value among "very slow", "slow", "normal", "fast", and "very fast".
  • Hesitation describes the hesitation of speaker speech and can have a value of either "yes" or "no".
  • Strength describes the strength of speaker speech and can have a value among "neutral", "weak", and "strong".
  • Positivity describes the positivity in speaker speech and can have a value among "neutral", "negative", and "positive".
  • Politeness describes the politeness in speaker speech and can have a value among "neutral", "rude" and "polite".
  • Engagement describes the engagement in speaker speech and can have a value among "neutral", "withdrawn", and "engaged".
  • Emotion describes the emotions in speaker speech and can have a value among "happy", "neutral", "angry", "sad", and "frustrated".

All of speaking rate, strength, positivity, politeness, engagement, and emotion also return a confidence output with discrete values among "low", "medium", and "high".

Finally, alerts contain a list of alerts for the agent based on various events in the call. Each alert is represented by its unique name, a start ("st") and end ("ts") time in seconds, as well as a description ("desc"). Currently implemented alerts are:

  • extended_silence: The agent has been silent for too long,
  • extended_overlap: The agent is speaking over the customer for too long,
  • slow_to_respond: The agent is slow to respond to something the customer has said,
  • speaking_slowly: The agent is speaking very slowly,
  • continuous_speaking: The agent is speaking continuously for too long,
  • energy_cue: The agent appears to be sounding too weak
  • empathy_cue: The customer sounded negative and the agent has an opportunity to show empathy
  • agent_energy_cue: The agent appears to be agitated

The JSON schema of the interim results follows. The streaming client is expected to read these websocket text messages and parse the JSON contents.

{
    "$schema": "http://json-schema.org/draft-06/schema#",
    "title": "Oliver API interim live result",
    "description": "Oliver API interim live response JSON schema v2.1",
    "type": "object",
    "properties": {
        "id": {
            "description": "The sequence number of the live result, (combination about PID & chunks number)",
            "type": "string"
        },
        "is_final": {
            "description": "Last response of live result. The last segment which includes 'is_final' is a dummy segment without added value",
            "type": "object",
            "properties": {
                "ts": {
                    "description": "Timestamp in seconds",
                    "type": "number"
                }
            }
        },
        "alerts": {
            "type": "array",
            "description": "A list of alert events",
            "items": {
                "type": "object",
                "description": "An alert event",
                "properties": {
                    "name": {
                        "description": "The alert type",
                        "type": "string",
                        "enum": [
                            "extended_silence", 
                            "extended_overlap",
                            "slow_to_respond",
                            "speaking_slowly",
                            "continuous_speaking",
                            "energy_cue",
                            "empathy_cue",
                            "agent_energy_cue"
                        ]
                    },
                    "st": {
                        "description": "The start time of the alert in seconds",
                        "type": "number",
                        "minimum": 0
                    },
                    "et": {
                        "description": "The end time of the alert in seconds",
                        "type": "number"
                    },
                    "desc": {
                        "description": "The alert description",
                        "type": "string"
                    }
                },
                "required": ["name", "st", "et"]
            }
        },
        "intensity": {
            "type": "object",
            "description": "Indicates the degree of the signal energy",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "Intensity degree for the first speaker, where the minimum and maximum values represent a very low and high intensity respectively",
                    "type": "number",
                    "minimum": 0,
                    "maximum": 10
                },
                "speaker2": {
                    "description": "Intensity degree for the second speaker, where the minimum and maximum values represent a very low and high intensity respectively",
                    "type": "number",
                    "minimum": 0,
                    "maximum": 10
                }
            },
            "required": ["st", "et"]
        },
        "vocal_variety": {
            "type": "object",
            "description": "Indicates the degree of sound variations within speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "Degree of vocal variety for the first speaker, where the minimum and maximum values represent a very low and high vocal variety respectively",
                    "type": "number",
                    "minimum": 0,
                    "maximum": 10
                },
                "speaker2": {
                    "description": "Degree of vocal variety for the second speaker, where the minimum and maximum values represent a very low and high vocal variety respectively",
                    "type": "number",
                    "minimum": 0,
                    "maximum": 10
                }
            },
            "required": ["st", "et"]
        },
        "speaking_rate": {
            "description": "Describes the speaking rate of speech",
            "type": "object",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete speaking rate for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["very slow", "slow", "normal", "fast", "very fast"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete speaking rate for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["very slow", "slow", "normal", "fast", "very fast"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "hesitation": {
            "description": "Hesitation",
            "type": "object",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete hesitation for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["no", "yes"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete hesitation for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["no", "yes"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "strength": {
            "type": "object",
            "description": "Describes the strength of the speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete strength for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "weak", "strong"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete strength for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "weak", "strong"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "positivity": {
            "type": "object",
            "description": "Describes the positivity in speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete positivity for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "negative", "positive"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete positivity for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "negative", "positive"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "politeness": {
            "type": "object",
            "description": "Describes the politeness in speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete politeness for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "rude", "polite"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete politeness for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "rude", "polite"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "engagement": {
            "type": "object",
            "description": "Describes the engagement in speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete engagement for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "withdrawn", "engaged"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete engagement for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["neutral", "withdrawn", "engaged"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        },
        "emotion": {
            "type": "object",
            "description": "Describes emotions in speech",
            "properties": {
                "st": {
                    "description": "The start time in seconds",
                    "type": "number",
                    "minimum": 0
                },
                "et": {
                    "description": "The end time in seconds",
                    "type": "number"
                },
                "speaker1": {
                    "description": "The discrete emotion for the first speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["happy", "neutral", "angry", "sad", "frustrated"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                },
                "speaker2": {
                    "description": "The discrete emotion for the second speaker",
                    "type": "object",
                    "properties": {
                        "value" : {
                            "type": "string",
                            "enum": ["happy", "neutral", "angry", "sad", "frustrated"]
                        },
                        "confidence": {
                            "type": "string",
                            "enum": ["low", "medium", "high"]
                        }
                    },
                    "required" : ["value"]
                }
            },
            "required": ["st", "et"]
        }
    },
    "required": ["id"]
}