File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1013_evalu.xml
Size: 7,063 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1013"> <Title>Partially Distribution-Free Learning of Regular Languages from Positive Samples</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Proof </SectionTitle> <Paragraph position="0"> We can now prove that this algorithm has the properties we claim. We use one technical lemma that we prove in the appendix.</Paragraph> <Paragraph position="1"> Lemma 1 Given a distribution D over , for any 0 < 1=2, when we independently draw Let H0; H1; : : :; Hk be the sequence of nite automata, the states labelled with multisets, generated by the algorithm when samples are generated by a target PDFA A.</Paragraph> <Paragraph position="2"> We will say that a hypothesis automaton Hi is -good if there is a bijective function from a subset of states of A including q0, to all the states of Hi such that (q0) is the root node of Hi, and if there is an edge in Hi such that (u; ) = v then ( 1(u); ) = 1(v) i.e. if Hi is isomorphic to a subgraph of the target that includes the root. If (q) = u then we say that u represents q. In this case the language generated by Hi is a subset of the target language. Additionally we require that for every state v in the hypothesis, the corresponding multiset satis es L1( ^Sv; P 1(v)) < =4. When a multiset satis es this we will say it is -good.</Paragraph> <Paragraph position="3"> We will extend the function to candidate nodes in the obvious way, and also the de nition of -good.</Paragraph> <Paragraph position="4"> De nition 1 (Good sample) We say that a sample of size N is - -good given a good hypothesis DFA H and a target A if all the candidate nodes with multisets larger than the threshold m0 are -good, and that if PA(L(A) L(H)) > then the number of strings that exit the hypothesis automaton is more than</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Approximately Correct </SectionTitle> <Paragraph position="0"> We will now show if all the samples are good, that for all i 0; 1; : : :; k, the hypothesis Hi will be good, and that when the algorithm terminates the nal hypothesis will have low error.</Paragraph> <Paragraph position="1"> We will do this by induction on the index i of the hypothesis Hi. Clearly H0 is good. Suppose Hi 1 is good, and we draw a good sample.</Paragraph> <Paragraph position="2"> Consider a candidate node (u; ) with multiset greater than m0.</Paragraph> <Paragraph position="3"> Since the previous hypothesis was good, this will be a representative of a state q and thus the multiset will be a sequence of independent draws from the su x distribution of this state Pq. Thus L1( ^Su; ; Pq) < =4 by the goodness of the sample. We compare it to a state in the hypothesis v. If this state is a representative of the same state in the target v, then L1( ^Sv; Pq) < =4 (by the goodness of the multisets), the triangle inequality shows that</Paragraph> <Paragraph position="5"> ison will return true. On the other hand, let us suppose that v is a representative of a di erent state qv. We know that L1( ^Su; ; Pq) < =4 and L1( ^Sv; Pqv) < =4 (by the goodness of the multisets), and L1(Pq; Pqv) (by the -distinguishability of the target). By the triangle inequality L1(Pq; Pqv) L1( ^Su; ; Pq) +</Paragraph> <Paragraph position="7"> return false. In these cases Hi will be good.</Paragraph> <Paragraph position="8"> Alternatively there is no candidate node above threshold in which case the algorithm terminates, and i = k. The total number of strings that exit the hypotheis must then be less than nj jm0 since there are at most nj j candidate nodes each of which has multiset of size less than m0. By the de nition of N and the goodness of the sample PA(L(A) L(H)) < . Since it is good and thus de nes a subset of the target language, this is a suitably close hypothesis.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Probably Correct </SectionTitle> <Paragraph position="0"> We must now show that by setting m0 su ciently large we can be sure that with probability greater than 1 all of the samples will be good. We need to show that with high probability a sample of size N will be good for a given hypotheis G. We can assume that the hypothesis is good at each step. Each step of the algorithm will increase the number of transitions in the active set by at least 1. There are at most nj j transitions in the target; so there are at most nj j+2 steps in the algorithm since we need an initial step to get the multiset for the root node and another at the end when we terminate. So we want to show that a particular sample will be good with probability at least 1 nj j+2.</Paragraph> <Paragraph position="1"> There are two sorts of errors that can make the sample bad. First, one of the multisets could be bad, and secondly too few strings might exit the graph. There are at most nj j candidate nodes, so we will make the probability of getting a bad multiset less than =2nj j(nj j+ 2), and we will make the probability of the second sort of error less than =2(nj j + 2).</Paragraph> <Paragraph position="2"> First we bound the probability of getting a bad multiset of size m0. This will be satis ed if we set 0 = =4 and 0 = =2nj j(nj j + 2), and use Lemma 1.</Paragraph> <Paragraph position="3"> We next need to show that at each step the number of strings that exit the graph will be not too far from its expectation, if PA(L(A) L(H)) > . We can use Cherno bounds to show that the probability too few strings exit the graph will be less than =2(nj j + 2)</Paragraph> <Paragraph position="5"> which will be satis ed by the value of N dened earlier, as can be easily veri ed.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Polynomial complexity </SectionTitle> <Paragraph position="0"> Since we need to draw at most nj j+2 samples of size N the overall sample complexity will be (nj j + 2)N, which ignoring log factors gives a sample complexity of O(n2j j2 2 1), which is quite benign. It is easy to see that the computational complexity is polynomial. Producing an exact bound is di cult since it depends on the length of the strings. The precise complexity also depends on the relative magnitudes of , j j and so on. The complexity is dominated by the cost of the comparisons. We can limit each multiset comparison to at most m0 strings, which can be compared naively with m20 string comparisons or much more e ciently using hashing or sorting. The number of nodes in the hypothesis is at most n, and the number of candidate nodes is at most nj j, so the number of comparisons at each step is bounded by n2j j and thus the total number of multiset comparisons by n2j j(nj j+2). Construction of multisets can be performed in time linear in the sample size. These observations su ce to show that the computation is polynomially bounded.</Paragraph> </Section> </Section> class="xml-element"></Paper>