The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,...,aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,...,pσ), where pi =|{j|sj =ai}|.Parikhvectorqoccursinsifshasasubstringtwithp(t)=q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n^2) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More pre- cisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n(σ/log σ )^(1/2)(log m/√m)), where m = SUM_i qi is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to O(n(σ/log σ)^(1/2)(1/√m )), i.e., by a factor of log m.

Algorithms for Jumbled Pattern Matching in Strings

CICALESE, Ferdinando;
2012-01-01

Abstract

The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,...,aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,...,pσ), where pi =|{j|sj =ai}|.Parikhvectorqoccursinsifshasasubstringtwithp(t)=q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n^2) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More pre- cisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n(σ/log σ )^(1/2)(log m/√m)), where m = SUM_i qi is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to O(n(σ/log σ)^(1/2)(1/√m )), i.e., by a factor of log m.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/3013671
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 55
  • ???jsp.display-item.citation.isi??? 39
social impact