EduNLP¶

SIF¶

EduNLP.SIF.sif.is_sif(item)[source]¶

Parameters

item –

Returns

when item can not be parsed correctly, raise Error;
when item doesn’t need to be modified, return Ture;
when item needs to be modified, return False;

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$，' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> is_sif(text)
False

EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, safe=True, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters

item –
figures –
safe –
symbol –
tokenization –
tokenization_params –
method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables
errors – warn raise coerce strict ignore

Returns

When tokenization is False, return SegmentList;
When tokenization is True, return TokenList

Examples

>>> test_item = r"如图所示，则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$，则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$，则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

EduNLP.SIF.sif.to_sif(item)[source]¶

Parameters: item –
Returns
Return type: item

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$（单位...'

EduNLP.Formula¶

EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]¶

The origin code author is https://github.com/hxwujinze

Parameters

formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体
index (int) – 本子树在树上的位置
forest_begin (int) – 本树在森林中的起始位置
father_tree (List[Dict]) – 父亲树
is_str (bool) –

Returns

tree (List[Dict]) – 重新解析形成的特征树
todo (finish all types)

Notes

Some functions are not supportd in katex e.g.,

tag
- \begin{equation} \tag{tagName} F=ma \end{equation}
- \begin{align} \tag{1} y=x+z \end{align}
- \tag*{hi} x+y^{2x}
dddot
- \frac{ \dddot y }{ x }

For more information, refer to katex support table

EduNLP.Formula.ast.get_edges(forest)[source]¶

构造边集合

Parameters: forest (List[Dict]) – 森林
Returns: edges – 边集合
Return type: list of tuple(src,dst,type)

EduNLP.Formula.ast.link_variable(forest)[source]¶

建森林

Parameters: forest (List[Dict]) –
Returns: trees
Return type: List[Dict]

EduNLP.I2V¶

class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) –
kwargs – the parameters passed to t2v

class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶

Parameters

name –
model_dir –

Returns

i2v model

Return type

I2V

EduNLP.Pretrain¶

class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶

Parameters: symbol – gms fgm

class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶

Parameters

symbol – gm fgm gmas fgmas
general – True when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly. False when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']

EduNLP.Tokenizer¶

class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]¶

Examples

>>> tokenizer = PureTextTokenizer()
>>> items = ["有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z']
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
>>> items = [{
... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$",
... "options": ["1", "2"]
... }]
>>> tokens = tokenizer(items, key=lambda x: x["stem"])
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]¶

Examples

>>> tokenizer = TextTokenizer()
>>> items = ["有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']

EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶

Parameters

name (str) –
args –
kwargs –

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

Vector¶

class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶

Examples

>>> model = RNNModel("ELMO", None, 2, vocab_size=4, embedding_dim=3)
>>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]]
>>> output, hn = model(seq_idx, indexing=False, padding=False)
>>> seq_idx = [[1, 2, 3], [1, 2], [3]]
>>> output, hn = model(seq_idx, indexing=False, padding=True)
>>> output.shape
torch.Size([3, 3, 4])
>>> hn.shape
torch.Size([2, 3, 2])
>>> tokens = model.infer_tokens(seq_idx, indexing=False)
>>> tokens.shape
torch.Size([3, 3, 4])
>>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False)
>>> tokens.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, indexing=False)
>>> item.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, agg="mean", indexing=False)
>>> item.shape
torch.Size([3, 2])
>>> item = model.infer_vector(seq_idx, agg=None, indexing=False)
>>> item.shape
torch.Size([2, 3, 2])