a9410738ae
fix permissions bug when creating directories, need to put in user/group execute bit.
Matt
2015-10-07 08:26:27 -06:00
4600ce0816
fix threads from freezing up just because pthread_create() had an error. need to return the thread stack.
Matt
2015-10-07 07:50:44 -06:00
836aa1756d
fix threads from freezing up just because pthread_create_thread() had errors. need to return the thread stack to the linked list of thread stacks.
Matt
2015-10-07 07:49:34 -06:00
6d90ea2e5f
try to launch threads even if none need cleanup. hopefully fixes thread freeze.
Matt
2015-10-07 07:29:37 -06:00
cee5d8922a
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-10-05 18:48:06 -06:00
5b605624ca
fix core dump average": 3.677707812426755e+26,
Matt
2015-10-05 18:47:11 -06:00
a77c9be5b8
Merge branch 'diffbot-testing' into diffbot-sam
Matt
2015-10-05 17:32:21 -06:00
df1c7f6e0f
update qa.cpp syntax test to do &n=100 for gbssStatusCode:0 query
Matt
2015-10-05 17:31:35 -06:00
97b9c99bec
correcting facet min=0
sam
2015-10-05 16:09:52 -07:00
49ec9c99fd
Don't restart all items when forcing a list of items into injector.
Zak Betz
2015-10-05 15:33:09 -06:00
9b785a1522
allow more than 2gb of mem to be allocated to hold resulting docids.
Matt
2015-10-05 09:35:44 -07:00
757a44b149
fix facets when doing > 1 split and first split termlist is empty.
Matt
2015-10-05 10:05:08 -06:00
21b71226a6
remove bad fix
Matt
2015-10-05 09:33:02 -06:00
c947252fee
Add gbcapturedate to individual doc's metadata when injecting warcs.
Zak Betz
2015-10-04 01:53:54 -06:00
39214a9dc6
Merge branch 'diffbot-testing' into testing
Matt
2015-10-02 19:26:15 -06:00
42cdd5b382
fix msg20 getsummary core
Matt Wells
2015-10-02 12:34:08 -07:00
e4adc99c0c
fix empty winner tree bug. try to improve rdbcache promotion logic for all caches. -O2 on spider.cpp.
Matt Wells
2015-10-02 12:16:48 -07:00
9daaa4d5af
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-10-01 19:37:12 -07:00
9178d67b2f
fix churn bug in winnerlistcache in spider.cpp so do not add the dolebuf list of spiderrequests back into the cache, but just modify the "jump" in the first 4 bytes of the cached record. because when we re-added it back to the cache it created too much churn and we'd lose cached records unnecessarily.
Matt Wells
2015-10-01 19:35:34 -07:00
6becb55a2b
Stream warcs instead of downloading them and unzipping them on disk.
Zak Betz
2015-09-30 22:25:59 -06:00
a8e3e4b269
if metadata is already in the old xmldoc::ptr_metadata then do not re-add it.
Matt
2015-09-30 21:46:07 -06:00
e06ae06c23
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-30 15:27:07 -06:00
a31c7f5fc8
exiting msg
Matt
2015-09-30 15:26:58 -06:00
06aea41611
show spiderdb scan progress in spider queue for the collection
Matt Wells
2015-09-30 13:38:22 -07:00
b97546f98c
do not expand about:blank iframes.
Matt Wells
2015-09-30 09:36:04 -07:00
cb4bbe8892
Merge branch 'ia' into ia-zak
Matt
2015-09-30 07:58:31 -06:00
d4c677170f
index metadata on EDOCUNCHANGED errors, and append new meta data to XmlDoc::ptr_metadata.
Matt
2015-09-30 07:57:40 -06:00
67fc339953
prevent out of mem core. actually trying to alloc more than 2GB for search result stuff.
Matt
2015-09-26 21:34:07 -07:00
f0a2f86200
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-25 08:09:25 -07:00
55993e58d5
fix cores on gi #0
Matt
2015-09-25 08:09:05 -07:00
2721256c0d
show ip port of bad host
Matt Wells
2015-09-25 07:47:21 -07:00
68a3679fae
exit if can not load auth/internetarchive.yml file
Matt
2015-09-25 08:33:25 -06:00
83ac18fff4
Merge branch 'master' into testing
Matt
2015-09-25 08:25:19 -06:00
e2fad81227
Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing
Matt
2015-09-25 08:24:54 -06:00
3ce6c7d941
Merge branch 'ia-zak' into testing
Matt
2015-09-25 08:24:12 -06:00
1e8f656d30
Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-25 08:23:42 -06:00
93943a0cab
some pages legitamately have no outlinks, no need to think they were banned.
Matt Wells
2015-09-24 14:01:23 -07:00
6454dad6bf
Revert "ignore real root, just use seeds, to detect if banned."
Matt Wells
2015-09-24 13:29:20 -07:00
cb60f68e72
ignore real root, just use seeds, to detect if banned.
Matt
2015-09-24 14:15:29 -06:00
260864b364
urgent fix for core dumps for some queries that have long termlists.
Matt
2015-09-24 11:49:50 -06:00
268b21d552
reduce log spam
Matt Wells
2015-09-24 11:37:16 -06:00
9be3f9310e
fix annoying core dump for some queries in Posdb.cpp
Matt
2015-09-24 11:34:02 -06:00
8a0461b82f
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-24 09:10:37 -06:00
d92b153090
added 'verify writes' switch to track down data corruption
Matt
2015-09-24 09:10:20 -06:00
16b6e44bd1
Show utf8 url in page results.
Zak Betz
2015-09-21 16:44:40 -06:00
100888d691
fix file/dir creation permissions bugs
Matt
2015-09-21 12:44:41 -06:00
74cde33a3a
just use the user's umask val for all file/dir creation
Matt
2015-09-21 11:33:38 -06:00
ce7b06fc4d
all files made are now group writable. if you don't like that then you can make a special group and set the directory just group writable for that group using chmod g+s <dir>.
Matt
2015-09-21 11:19:34 -06:00
eefbe95ce9
Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-21 10:13:29 -06:00
786ba76d10
Merge branch 'ia-zak' into testing
Matt
2015-09-21 10:12:58 -06:00
5aaa08d81a
Merge branch 'ia-zak' of github.com:gigablast/open-source-search-engine into ia-zak
Matt
2015-09-21 10:12:38 -06:00
83190e3bbc
Make punycoded urls printable.
Zak Betz
2015-09-21 09:17:40 -06:00
5635695666
oom prevention
Matt
2015-09-20 21:42:46 -06:00
d6d5d10a15
prevent core from bad root title rec
Matt Wells
2015-09-20 08:26:00 -07:00
69a3cb0999
fix corrupt tag with corrupt root title buf from coring
Matt Wells
2015-09-17 21:33:58 -07:00
13e0ba7bff
fix bug of having a meta redirect tag in <script> tags. we have to use Xml class to make sure it is a legit refresh tag.
Matt
2015-09-16 11:03:38 -06:00
58e9f56015
never let any diffbot error prevent us from retrying a url in subsequent crawl rounds.
Matt
2015-09-16 10:00:11 -06:00
bcdecc63c6
expose "urlip" injection parm to provide ip of url being injected to save gigablast from an ip lookup if you want.
Matt
2015-09-16 09:43:15 -06:00
d11761cfd9
update graph key
Matt
2015-09-15 15:48:40 -06:00
b2f7e72d8a
fix core from showing graph to two users
Matt
2015-09-15 15:46:38 -06:00
32d7f5cb97
better warc injection load balancing
Matt
2015-09-15 15:04:26 -06:00
28ed7f66af
Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-14 19:16:20 -06:00
f9c4f8fc9a
test with 2 first
Matt
2015-09-14 19:16:07 -06:00
fcd4fe3ff3
Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-14 19:04:15 -06:00
16d90e60d1
all up to 10 filter threads now
Matt
2015-09-14 19:03:52 -06:00
5caa219c71
Reduce false positives by not counting \0 as a non-ascii char in the url.
Zak Betz
2015-09-14 12:24:50 -06:00
5d724cdcc3
Check for spaces before non-ascii chars to reduce false positives. Also print the position of non-ascii char to aid debugging. We still need to handle utf8 chars in path.
Zak Betz
2015-09-14 11:11:56 -06:00
519b2c4f42
Fix repeating xn--xn-- when there are spaces in the domain. Make gb unittest take a name of the unit test to run.
Zak Betz
2015-09-14 10:24:22 -06:00
519017828c
Enable punycode domains for testing. We still need to display them as utf8 on the front end.
Zak Betz
2015-09-14 09:32:25 -06:00
5622ca47ee
Work on non-ascii domain names. It works on correct inputs, but will crash on some non correct inputs, so it is forced to be disabled.
Zak Betz
2015-09-14 00:34:44 -06:00
b4bac67fdf
Merge branch 'diffbot-testing' into testing
Matt
2015-09-13 19:40:57 -06:00
710661a0f3
Merge branch 'ia-zak' into ia
Matt
2015-09-13 19:40:21 -06:00
ffa465b942
Merge branch 'ia' of github.com:gigablast/open-source-search-engine into ia
Matt
2015-09-13 19:40:17 -06:00
f1db3aca94
Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-13 19:39:50 -06:00
b8e4046d61
remove unnecessary line
Matt Wells
2015-09-13 17:54:38 -07:00
65613feb4c
fix bug of not using part files when generating map
Matt
2015-09-13 17:52:40 -07:00
cb6ca24c26
Allow nospider and noquery on the same host. Fix punycoding of non-ascii domains.
Zak Betz
2015-09-13 17:15:31 -06:00
3444c67851
exit faster.
Matt
2015-09-13 14:28:35 -07:00